WO2023006183A1

WO2023006183A1 - Method for managing the storage of data segments on a storage device

Info

Publication number: WO2023006183A1
Application number: PCT/EP2021/070951
Authority: WO
Inventors: Elizabeth FIRMAN; Idan Zach; Assaf Natanzon; Aviv Kuvent; Yair Toaff
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2023-02-02

Abstract

Provided is a method for managing the storage of data segments on a storage device (100). The method includes storing one or more fixed-size data segments (102A-N, 204A-N) of a determined fixed size on the storage device (100). The method includes receiving an input data stream (202). The method includes, segmenting the input data stream (202) into one or more variable size data segments (206) each having a variable size. The method includes creating offset data segments (208A-B) from the variable size data segments (206) by offsetting the end of each variable size segment so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size. The method includes identifying duplicate data in the input data stream (202) by detecting the duplicate data in the offset data segments (208A-B).

Description

METHOD FOR MANAGING THE STORAGE OF DATA SEGMENTS ON A

STORAGE DEVICE

TECHNICAL FIELD

The disclosure relates to a method for managing the storage of data segments on a storage device, and more particularly, the disclosure relates to the storage device that is configured to manage the storage of data segments.

BACKGROUND

File system storage divides data of a file into fixed-size blocks, while a logical representation of the file points to the fixed-size blocks. Some existing storage systems optimize disk space by finding duplicates of the fixed-size blocks. Deduplication is a method that is used to reduce the amount of used storage by identifying granularity of duplicates of data and avoiding storing the duplicates of data explicitly. There are two methods to identify the granularity of data as a duplicate. The first method is used to divide the data into predefined fixed-size segments but misses duplicates of data that are aligned differently. The second method is used to divide the data into variable-size data segments by finding boundaries that depend on context and a criterion. But, the variable size data segments consume more resources, computation time and are more complicated to implement.

Several known approaches are employed to perform variable size segmentation to store fixed- size chunks and to assign reference points to the fixed-size chunks. The reference points that are assigned to the fixed-size chunks include all the data or partial data (prefix or suffix) of that chunk. The existing approaches calculate fingerprints on variable size segments and search for matches in a fingerprint store to perform the deduplication method in backup streams. By saving a reference stream, the deduplication method saves a backup stream that points to the fixed size chunks to compose the variable size segments. This is done to handle misalignments between the variable size segments and the fixed-size chunks. Another known approach employs a method that uses hierarchal deduplication by storing fingerprints in a Merkle tree, where the leaf nodes represent fixed-size blocks. The hierarchal deduplication includes three levels. The first level includes segmenting the entire object into variable size segments and composing each variable size segment with the fixed-size blocks. For alignment, the last fixed-size block of each variable size segment can be smaller than the fixed size. The file systems that include the fixed size deduplication are aligned to the variable size segments using a segmentation technique and store additional metadata to identify the blocks smaller than the fixed size or to calculate segments that are multiples of the fixed-size blocks.

However, the aforementioned known approaches store additional fingerprint metadata that requires a lot of memory for fast searches. Moreover, the deduplication method is implemented either partially or entirely. Moreover, write-once read-many scenario is successful in implementing the deduplication method, whereas in-place or overwrite modifications are still unable to implement the deduplication method.

Therefore, there arises a need to address the aforementioned technical drawbacks in known techniques or technologies in managing the storage of data segments on a storage device.

SUMMARY

It is an object of the disclosure to provide a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments while avoiding one or more disadvantages of prior art approaches.

This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures.

The disclosure provides a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments.

According to a first aspect, there is provided a method for managing the storage of data segments on a storage device. The method includes storing a plurality of fixed-size data segments of a determined fixed size on the storage device. The method includes receiving an input data stream. The method includes segmenting the input data stream into a plurality of variable size data segments each having a variable size. The method includes creating offset data segments from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. The method includes identifying duplicate data in the input data stream by detecting the duplicate data in the offset data segments.

The method improves an efficiency of the storage device by deduplicating the data segments without changing underlying data. The method writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the method can detect duplicates of the data in the fixed-size data segments. The deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The method provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments. The method provides the offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized. A fixed size data segments deduplication method may be implemented in scenarios such as in-place or overwrite modifications.

Optionally, segmenting the input data stream uses as parameters the average, the maximum, and the minimum sizes of each of the variable size segments created from segmenting the input data stream. The parameters related to the segmentation provide efficient deduplication in cases such as in-place or overwrite modifications.

Optionally, if the determined fixed size is greater than a first predetermined value, the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size. Preferably, in that case, the maximum size of each of the variable size segments is set to be equal to the determined fixed size. Optionally too, if the determined fixed size is smaller than a second predetermined value, the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size. Preferably, in that case, the minimum size of each of the variable size segments is set to be equal to the determined fixed size. The determined fixed size provides a threshold value for applying the deduplication method. Also, the threshold value helps to optimize the usage of memory during deduplication. Optionally, the method further includes inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments. The method provides writing variable size data segments on top of existing fixed-size deduplication, while optimizing the random IOs using reserved space, thereby increasing de-duplication ratio and the efficiency of the storage device.

Optionally, the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments.

Optionally, the method further includes storing the offset data segment into a repository.

The method optimizes the memory by inserting the offset data segments based on the variable size of data segments, thereby improving storage capacity of the storage device.

Optionally, the method further includes storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.

In the case of a large-sized input data stream, the memory is optimized by saving the relative offset, thereby improving the efficiency of the storage device. The relative offset is optionally an offset that is relative to the fixed size data segments that are divided from the input data stream.

Optionally, creating offset data segments from the variable size data segments, comprises padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size. The padding character occupies a negligible space as it is highly compressible. Hence, the memory is utilized efficiently.

According to a second aspect, there is provided a storage device. The storage device is configured to store a plurality of fixed-size data segments of a determined fixed size. The storage device is configured to receive an input data stream. The storage device is configured to segment the input data stream into a plurality of variable size data segments each having a variable size. The storage device is configured to create the offset data segments from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. The storage device is configured to identify duplicate data in the input data stream by detecting the duplicate data in the offset data segments.

An efficiency of the storage device is improved by deduplicating the data segments without changing underlying data. The storage device writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the storage device detects duplicates of the data in the fixed-size data segments. The deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The storage device provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments. The storage device provides the offset data segments as highly compressible so that storage capacity is occupied by the offset data segments is optimized. A deduplication method may be implemented in scenarios such as in-place or overwrite modifications.

Optionally, the storage device further configured to segment the input data stream, using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream. The storage device is configured to use the parameters that are related to the segmentation to provide efficient deduplication in cases such as in-place or overwrite modifications.

Optionally, the storage device further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value. The storage device provides the determined fixed size as a threshold value for applying the deduplication method. Also, the threshold value helps to optimize the usage of memory during deduplication.

Optionally, the storage device further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.

Optionally, the storage device further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments. Optionally, the storage device further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. The storage device optimizes memory by inserting the offset data segments based on the variable size of data segments, thereby improving a storage capacity of the storage device.

Optionally, the storage device further includes a repository for storing the offset data segment.

Optionally, the storage device further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment. In case of a large-sized input data stream, the storage device optimizes the memory by saving the relative offset, thereby improving the efficiency of the storage device. The relative offset is optionally an offset that is relative to the fixed size data segments that are divided from the input data stream.

Optionally, the storage device further configured to create the offset data segments from the variable size data segments by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.

The padding character occupies a negligible space as it is highly compressible. Hence, the memory is utilized efficiently.

A technical problem in the prior art is resolved, where the technical problem concerns requiring a lot of memory for fast searches due to storage of additional fingerprint metadata, where in- place or overwrite modifications are unable to implement the deduplication method.

Therefore, in contradistinction to the prior art, according to the method for managing the storage of data segments on the storage device, the storage device that is configured to manage the storage of data segments by deduplicating the data segments without changing underlying data, thereby improving the performance of the storage device. The storage device writes the offset data segments and aligns them according to their variable size data segments to obtain the fixed- size data segments. Once the offset data segments are written, the storage device can detect duplicates of the data in the fixed-size data segments. The deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The storage device provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments. The storage device provides offset data segments as highly compressible so that a storage capacity occupied by the offset data segments is optimized. A deduplication method is implemented in scenarios such as in-place or overwrite modifications.

These and other aspects of the disclosure will be apparent from and the implementation(s) described below.

BRIEF DESCRIPTION OF DRAWINGS

Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a storage device in accordance with an implementation of the disclosure;

FIG. 2 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments using a storage device in accordance with an implementation of the disclosure;

FIG. 3 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments of sizes 30 Kilobytes (KB), 30 KB, and 14 KB using a storage device in accordance with an implementation of the disclosure;

FIG. 4 is an exemplary diagram that illustrates random IOs of 32KB and re-segmentation results in accordance with an implementation of the disclosure;

FIGS. 5A and 5B are exemplary diagrams that illustrate random IO of 20KB and re segmentation results in accordance with an implementation of the disclosure;

FIG. 6 is a flow diagram that illustrates a method for managing the storage of data segments on a storage device in accordance with an implementation of the disclosure; and FIG. 7 is an illustration of a computing arrangement (e.g. a storage device) that is used in accordance with implementations of the disclosure.

DETAILED DESCRIPTION OF THE DRAWINGS

Implementations of the disclosure provide a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments.

To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.

Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.

FIG. 1 is a block diagram of a storage device 100 in accordance with an implementation of the disclosure. The storage device 100 is configured to store one or more fixed-size data segments 102A-N of a determined fixed size. The storage device 100 is configured to receive an input data stream. The storage device 100 is configured to segment the input data stream into one or more variable size data segments each having a variable size. The storage device 100 is configured to create the offset data segments from the one or more variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. The storage device 100 is configured to identify duplicate data in the input data stream by detecting the duplicate data in the offset data segments.

An efficiency of the storage device 100 is improved by deduplicating the data segments without changing underlying data. The storage device 100 writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the storage device 100 detects duplicates of the data in the fixed-size data segments. The deduplication in the one or more fixed-size data segments 102A-N is enhanced by writing the offset segments, and not the actual segments. The offset segments enhance de- duplication ratio with minimal overhead. The storage device 100 provides variable size segmentation of the input data stream to gain flexibility to shift in the input data stream while using the one or more fixed-size data segments 102A-N. The storage device 100 provides offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized.

Optionally, the storage device 100 is further configured to segment the input data stream, using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.

Optionally, the storage device 100 is further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.

Optionally, the storage device 100 is further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.

Optionally, the storage device 100 is further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.

Optionally, the storage device 100 is further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. Optionally, the storage device 100 further includes a repository for storing the offset data segment.

Optionally, the storage device 100 is further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment relative to a maximum data segment size in the input data stream, and the size of the given variable size segment.

Optionally, the storage device 100 is further configured to create the offset data segments from the variable size data segments by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.

FIG. 2 is an exemplary diagram that illustrates segmenting an input data stream 202 into one or more variable size data segments 206 using a storage device in accordance with an implementation of the disclosure. The storage device is configured to store one or more fixed- size data segments 204A-N of a determined fixed size. The storage device is configured to receive the input data stream 202. The storage device is configured to segment the input data stream 202 in to one or more variable size data segments 206 each having a variable size. The storage device is configured to create the offset data segments 208A-B from the one or more variable size data segments 206 by offsetting the end of each variable size segment so that the size of each of the offset data segments 208A-B is a multiple of the determined fixed size. The offset data segments 208A-B are shown with dots in the figure. The one or more variable size data segments 206 are shown with diagonal lines in the figure.

The storage device segments the input data stream 202 into the one or more variable size data segments 206 based on use of a rolling hash, and a central processing unit (CPU) optimized segmentation. Optionally, a determined fixed size of the one or more fixed-size data segments 204A-N varies between different storage devices, and is also configurable based on a file system used in the storage device.

Optionally, the storage device is further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value. Optionally, the storage device is further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.

Optionally, when large input outputs (IOs) to a new file are performed, the storage device aligns the one or more variable size data segments 206 to the determined fixed size, adds a padding character, and saves additional metadata to enable reads, thereby improving a de-duplication ratio. Hence, the storage efficiency of the storage device is increased and the costs of storage is reduced.

Optionally, when random IOs are received to an existing file (e.g. a live file), then the storage device, (i) performs variable size segmentation for a large write that is received as the random IOs. (ii) reads enough data around a location, where new data is to be written and performs variable size segmentation with the new data to maintain consistency for a small write with the variable size segmentation. As there might be more data segments for reading and writing, enough space is required for new aligned variable size segments.

Optionally, the variable size segmentation is performed, in case of a large input data stream, to obtain the one or more fixed-size data segments 204A-N to maintain consistency with the variable size segmentation. The de-duplication ratio remains the same. The storage device may create new data segments on a live file is known as re-segmentation. Optionally, a data management model is used to optimize a reserved space and minimize overhead of the data to be read to achieve efficient re-segmentation. The re-segmentation may be used in online or offline deduplication of data segments based on a workload. During the re-segmentation, input- outputs (IOs) and computational overhead are added, thereby enabling to write in aligned data segments and save a bitmap of changed data segments to perform the re-segmentation offline.

Optionally, the storage device includes a repository for storing the offset data segments 208A- B. Optionally, each user file is mapped to a file in the repository. The file in the repository includes the offset data segments 208A-B, the padding character that is added, and a reserved space. The reserved space may be used for future re-segmentation using sparse files.

Optionally, for reading the input data stream 202, the metadata enables to map the offset data segments 208A-B with the one or more fixed-size data segments 204A-N by considering the padding character. The input data stream 202 may be divided into logical data segments. Each logical data segment has a maximum size. The maximum size of the fixed-size data segment is a predefined constant value, and it is important as limits a number of reserved spaces. Optionally, during a write operation, the metadata is stored for each fixed-size data segment. The metadata includes an offset and a size. The offset is relative to the offset of the fixed logical data segments of the input data stream 202 that are mapped to the one or more fixed-size data segments 204A-N. The offset of each fixed logical data segment is between 0 and 1. The size of data of the one or more fixed-size data segments 204A-N is determined by mapping each fixed logical data segment to a constant number of the one or more fixed-size data segments 204A-N. The size of the one or more fixed-size data segments 204A-N may be greater than 1. Optionally, the offset of the input data stream 202 is mapped to the offset of the repository using the metadata.

Optionally, during a read operation, a fixed-size data segment 204A from where the reading operation starts from is determined by, (i) providing a start offset of read that is relative to a user view of the fixed-size data segment 204A, determining a start offset of read that is relative to a repository view, (ii) calculating an index of the fixed-size data segment 204A by dividing the start offset by the size of the fixed-size data segment 204A, (iii) searching the relative offset in the corresponding one or more fixed-size data segments 204A-N, (iv) searching the metadata of corresponding fixed-size data segments by providing the index of the fixed-size data segment 204A and (v) identifying the fixed-size data segment 204A from where the reading operation starts. Optionally, the search of metadata of the corresponding fixed-size data segments includes a last fixed-size data segment from a previous fixed-size data segment. Once the start offset is identified, the metadata is retrieved from that point to read the corresponding fixed- size data segments.

Optionally, during the write operation, the segmentation is performed on the input data stream 202 that is appended to leftover from previous segmentation. The leftover from the previous segmentation is used to calculate the boundaries of data segments. Thus, for any size of a write operation, the de-duplication ratio is higher, especially in the case of the small-sized input data stream.

Optionally, a ratio between maximum segment size and minimum segment size is considered as k. Each write operation segmentation may be performed for new files and re-segmentation may be performed for live files. During the write operation for the live files, new data is to be written after performing the read operation on the fixed-size data segment 204A. When writing the new data to the data segments, an offset that is relative to a user view is translated to a repository offset and a relative offset in the fixed-size data segment 204A to the metadata of the one or more fixed-size data segments 204A-N are saved. The size of data in the one or more fixed-size data segments 204A-N are saved. The one or more fixed-size data segments 204A- N are reserved for re-segmentation by starting the write operation for each data segment at the corresponding fixed-size data segment that is k times of the one or more fixed-size data segments 204A-N. The one or more fixed-size data segments 204A-N represent a present data segment that may include data from a next data segment to maintain a data segment alignment, not affecting the de-duplication ratio as there is no deletion of any data segment during alignment. The write operation may depend on the size of the one or more fixed-size data segments 204A-N compared to the size of the one or more variable size data segments 206.

Optionally, when the size of the one or more fixed-size data segments 204A-N is less than or equal to the size of the one or more variable size data segments 206, each data segment is mapped to 2k times of the fixed-size data segments 204A-N. When a variable size data segment is smaller than the one or more fixed-size data segments 204A-N, the end of the variable size data segment is padded with a padding character to align with the size of the fixed-size data segment. Each variable size data segment may be mapped to k times of the one or more fixed- size data segments 204A-N. Hence, each data segment of the input data stream 202 in the repository is mapped to k times of the one or more fixed-sized data segments 204A-N, including data and reserved space.

Optionally, when the size of the one or more fixed-size data segments 204A-N is greater than or equal to the variable size of the one or more variable size data segments 206, each variable size data segment is mapped to at most k times of the one or more fixed-size data segments 204A-N and at least to a single fixed-size data segment. When the fixed size data segment is smaller than the variable-size data segments 204A-N, the end of the data segment is padded with the padding character to align with the size of the one or more fixed-size data segments 204A-N. Each variable size data segment may be mapped to two times of k times of the one or more fixed-size data segments 204A-N. Hence, for each size of the one or more variable size data segments 206, the repository file includes two times of k times of the one or more fixed- size data segments 204A-N, including the data and the reserved space.

Optionally, to initiate the re-segmentation process for the random IOs, an upper bound is set. When the size of the one or more fixed-size data segments 204A-N is less than or equal to the size of the one or more variable size data segments 206, the upper bound of the random IO overhead is a single fixed-size data segment after the last modified fixed-size data segment, that is three extra data IOs are present, that is one read, and two write along with one delete. When the size of the one or more fixed- size data segments 204 A-N is greater than or equal to the size of the one or more variable size data segments 206, the upper bound of the random IO overhead is the single data segment after the last modified fixed-size data segment, i.e. the number of IOs depend on specific data segment span.

FIG. 3 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments (e.g. 4 variable size data segments) of sizes 30 Kilobytes (KB), 30 KB, and 14 KB using a storage device in accordance with an implementation of the disclosure. The exemplary diagram shows mapping of 2 variable size data segments to 2 fixed- size data segments of size 128 Kilobytes (KB) each. Each fixed-size data segment includes 4 fixed-size data segment grains. One of the 2 fixed-size data segments represents the start offsets in a range of 0-128 Kilobytes (KB), and another fixed-size data segment represents the start offsets in a range of 128-256KB. Two reserved fixed-size data segment grains with indexl and index2 of size 32KB are shown in the figure. Padding of 18KB is shown in the fixed-size data segment that starts offsets in the range 128-256KB. Optionally, a ratio between a maximum segment size and a minimum segment size is considered as k.

The following mapping is performed by considering k = 4 according to a segmentation algorithm, based on the following calculations:

FSBlocklndex = Int(Segment.UserOffset / FSGrainSize)

FSBlockGrainlndex = Int((Segment.UserOffset % FSGrainSize) / MinSize)

The new data segment is stored in FSGrain. Index = 4 * FSBlocklndex + FSBlockGrainlndex

Segment 1 - offset 0, size 30KB

FS grain 0 (offset 0)

Location calculation:

FSBlocklndex = 0/32 = 0 FSBlockGrainlndex = (0%32) / 8 = 0 FSGrain. Index = 4*0 + 0 = 0 Stored MD: offset = 0, size = 30

Segment 2 - offset 28, size 30KB - FS grain 3 (offset 96KB) Location calculation:

FSBlocklndex = Int(28 / 32) = 0 FSBlockGrainlndex = Int((28 % 32) / 8) = 3 FSGrain. Index = 4 * 0 + 3 = 3 Stored MD: offset = 28, size = 30

Segment 3 - offset 60, size 14KB - FS grain 7 (offset 224KB)

Location calculation:

FSBlocklndex = Int(60 / 32) = 1 FSBlockGrainlndex = Int((60 % 32) / 8) = 3 FSGrain. Index = 4 * l + 3 = 7 Stored MD: offset = 28, size = 14

FIG. 4 is an exemplary diagram that illustrates random IOs of 32KB and re-segmentation results in accordance with an implementation of the disclosure. The exemplary diagram includes writing new data at offset 16KB, of size 32KB and results in 5 data segments of sizes 20KB, 16KB, 8KB, 18KB, and 14KB using re-segmentation. The re-segmentation includes calculating fixed-size data segments to read that are stored in a repository. The exemplary diagram illustrates writing starts at index 0 and ends at 3 as shown in FIG. 3. The fixed-size data segments in the range [0,3] are read. The exemplary diagram includes performing re segmentation on the fixed-size data segments with new data. The last boundary of the fixed- size data segments is aligned to previously fixed data-size segments. For example, the last boundary is aligned currently between 4 and 5, previously between 2 and 3 as shown in the figure. The following mapping is performed based on the segmentation algorithm.

Segment 1 - offset 0, size 20KB

FS grain 0 (offset 0)

Location calculation: FSBlocklndex = 0/32 = 0

FSBlockGrainlndex = (0%32) / 8 = 0 FSGrain. Index = 4*0 + 0 = 0 Stored MD: offset = 0, size = 20

Segment 2 - offset 20, size 16KB - FS grain 2 (offset 64KB) Location calculation:

FSBlocklndex = Int(16/32) = 0 FSBlockGrainlndex = Int ((16%32) / 8) = 2 FSGrain. Index = 4*0 + 2 = 2 Stored MD: offset = 20, size = 16

Segment 3 - offset 36, size 8KB - FS grain 4 (offset 128KB)

Location calculation:

FSBlocklndex = Int(36/32) = 1 FSBlockGrainlndex = Int((36%32) / 8) = 0 FSGrain. Index = 4*1 + 0 = 4 Stored MD: offset = 4, size = 8

Segment 4 - offset 44, size 18KB -> FS grain 5 (offset 160KB)

Location calculation:

FSBlocklndex = Int(44/32) = 1 FSBlockGrainlndex = Int((44%32) / 8) = 1 FSGrain. Index = 4*1 + 1 = 5 Stored MD: offset = 12, size = 18

Segment 5 - offset 62, size 14KB - FS grain 7 (offset 224KB)

Location calculation:

FSBlocklndex = Int(62/32) = 1 FSBlockGrainlndex = Int((62%32) / 8) = 3 FSGrain. Index = 4*1 + 3 = 7 Stored MD: offset = 30, size = 14

FIGS. 5A and 5B are exemplary diagrams that illustrate random IO of 20KB and re segmentation results in accordance with an implementation of the disclosure. FIG. 5A is an exemplary diagram that illustrates the random IO of 20KB and one of the results of re segmentation. The exemplary diagrams include writing new data at offset 16KB, of size 32KB and results in final data segments of sizes 20KB, 16KB, 8KB, 18KB, 8KB, and 14KB using re segmentation. The data segments 1-4 are segmented based on a segmentation algorithm. The fixed-size data segment 5 includes a data segment that is created using the re-segmentation algorithm. The re-segmentation includes calculating fixed-size data segments to read that are stored in a repository. The exemplary diagrams depict writing start at index 4 and end at index 5. The fixed-size data segments in the range [4,5] are read. The exemplary diagram illustrates performing re-segmentation on the fixed-size data segments with new data. The segmentation is performed on 26KB and data segment 3 is obtained of size 18KB. The data segment 3 is stored in the fixed-size data segment 4 and a leftover of size 8KB is identified with no boundary. As the leftover is equal to a minimum data segment size, the data segment 3 is stored in its fixed-size data segment, that is in index 6.

FIG. 5B is an exemplary diagram that illustrates the random IO of 20 KB and another result of the re-segmentation. The exemplary diagram includes writing new data at offset 20KB, of size 32KB and results in final data segments of sizes 20KB, 16KB, 20KB, and 20KB using re segmentation. The fixed-size data segment 7 includes a data segment that is created using a re segmentation algorithm. The re-segmentation includes calculating fixed-size data segments to read that are stored in the repository. The exemplary method of writing starts at index 4 and ends at index 5. The fixed-size data segments in the range [4,5] are read. The exemplary diagram includes performing re-segmentation on the fixed-size data segments with new data. The segmentation is performed on 26KB and data segment 3 is obtained of size 20KB. The data segment 3 is stored in the fixed-size data segment 4 and a leftover of size 6KB is identified with no boundary. As the leftover is less than a minimum data segment size 6KB+14KB, the next data segment is read, that is data segment 7 with a size of 14KB. The data segments are appended to a single data segment of size 20KB and are stored in data segment 7.

The re-segmentation algorithm provides writing variable size data segments on top of existing fixed-size deduplication while optimizing the random IOs using reserved space, thereby increasing a de-duplication ratio and the efficiency of a storage device.

FIG. 6 is a flow diagram that illustrates a method for managing the storage of data segments on a storage device in accordance with an implementation of the disclosure. At a step 602, one or more fixed-size data segments of a determined fixed size are stored on the storage device. At a step 604, an input data stream is received. At a step 606, the input data stream is segmented into one or more variable size data segments each having a variable size. At a step 608, offset data segments are created from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. At a step 610, duplicate data is identified in the input data stream by comparing the offset data segments with the one or more fixed-size data segments.

The method improves an efficiency of the storage device by deduplicating the data segments without changing underlying data. The method writes the offset data segments and aligns them according to their variable size to obtain the one or more fixed-size data segments. Once the offset data segments are written, the method can detect duplicates of the data in the fixed-size data segments. The deduplication in the one or more fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The method provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the one or more fixed-size data segments. The method provides the offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized. A deduplication method may be implemented in scenarios such as in-place or overwrite modifications.

Optionally, segmenting the input data stream uses as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.

Optionally, if the determined fixed size is greater than a first predetermined value, the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size.

Optionally, if the determined fixed size is smaller than a second predetermined value, the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size.

Optionally, the method further includes inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.

Optionally, the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. Optionally, the method further includes storing the offset data segment into a repository.

Optionally, the method further includes storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream, and the size of the given variable size segment.

Optionally, creating offset data segments from the variable size data segments, includes padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.

Optionally, a ratio between maximum segment size and minimum segment size is considered as k.

The following coding includes defining one or more fixed-size data segments by considering k = 4 and the variable segment size is greater than equal to the fixed segment size.

1. MinSize = FSGrainSize (minimal segment size)

2. MaxSize = 4 * FSGrainSize

3. UserBlockSize = MaxSize

4. Segment = Variable-length segment

5. Segment.UserOffset = The user offset in which the variable-length segment starts

6. Segment. Size = The size (in bytes) of the variable-length segment NOTE: MinSize <= Segment. Size <= MaxSize

7. FSGrainsF actor = 2 * k // 8 in our use-case

8. FSBlockSize = FSGrainsFactor * FSGrainSize

9. FSBlocklndex = Int(Segment.UserOffset / UserBlockSize)

10. FSBlockGrainlndex = [0, 1, 2, 3, ... ,(FSGrainFactor-l)]

11. FSBlockGrainlndex. Start = 2 * Int((Segment.UserOffset % UserBlockSize) / MinSize) + (Segment.UserOffset % FSGrainSize == 0 ? 0 : 1)

The following coding includes writing sequentially on the fixed size data segments.

1. Calculate NextSegment (UserOffset, Size)

2. Set FSBlocklndex = Int(Segment.UserOffset / UserBlockSize)

3. Set FSBlockGrainlndex. Start = 2 * Int((Segment.UserOffset % UserBlockSize) / FSGrainSize) + (Segment.UserOffset % FSGrainSize == 0 ? 0 : 1) 4. Store the new segment starting in FSGrain. Index = FSGrainsF actor * FSBlocklndex + F S SubB 1 ockGrainlndex Start

5. while true

6. Store the data [UserOffset, min(size, FSGrainSize)] in FSGrain. Index

7. size -= FSGrainSize

8. If (size <= 0) break

9. Useroffset += FSGrainSize

10. blocklndex = Int(UserOffset / UserBlockSize)

// If the segment crosses FSBlocks, jump to the 1st FSGrain (index=0) in the next FSSubBlock

11. If (blocklndex > FSBlocklndex)

12. FSGrain. Index = FSGrainsF actor * blocklndex

13. FSSubBlocklndex = blocklndex

14. else

// If the segment requires multiple FSGrains, use the consecutive FSGrains in the same FSSubBlock

15. FSGrain. Index++

16. end-if

17. end- while

18. DONE

The following coding includes re-segmentation algorithm:

// Parameters: <offset, size> in user-view of the write chunk

1. Find the <FSBlockIndex, FSBlockGrainIndex> of the 1^st FSGrain (FSGrain. Index. start) and the last FSGrain (FSGrain. Index. end)

2. Read all data in the range [FSGrain. Index. start, FSGrain. Index. end]

3. Run Segmentation on the data in the range [FSGrain. Index. start, FSGrain. Index. end]

// Let's denote the amount of "left over" data size after the segmentation as the remainder

4. If there is no remainder (remainder == 0) — > DONE

// Create a new segment with ONLY the data of the remainder

5. If remainder >= MinSize

6. Create a new segment for the remainder 7. Store the new segment in the appropriate location - <FSBlockIndex, FSBlockGrainIndex>

8. DONE

9. end-if

// remainder < MinSize

10. Read the next segment just after the FS Grain. Index. end (fully unchanged segment)

11. If (remainder + nextSegment.size) <= MaxSize

12. Add the reminder into the nextSegment

13. Store the modified segment in the appropriate location

14. // NOTE: the FSGrain location might be changed. Thus, we may need to delete the old segment

15. DONE

16. end-if

// (remainder + nextSegment.size) > MaxSize

// In this case, we are going to split the (remainder + nextSegment.size) into exactly 2 segments

17. Run segmentation on remainder+nextSegment with MaxSize = (remainder + nextSegment.size - MinSize) to find the 1st cut-point // NOTE: this special MaxSize setting ensures at least 2 segments

18. Create 2 new segments using the 1st cut-point

19. Store the 2 new segments in the appropriate locations

20. // NOTE: the locations might be different than the location of the old segment. Thus, we may need to delete the old segment

21. DONE

FIG. 7 is an illustration of an exemplary computing arrangement 700 (e.g. a storage device) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computing arrangement 700 includes at least one processor 704 that is connected to a bus 702, wherein the computing arrangement 700 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol (s). The computing arrangement 700 also includes a memory 706. Control logic (software) and data are stored in the memory 706 which may take the form of random-access memory (RAM). In the present description, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.

The computing arrangement 700 may also include a secondary storage 710. The secondary storage 710 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive at least one of reads from and writes to a removable storage unit in a well-known manner.

Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 706 and the secondary storage 710. Such computer programs, when executed, enable the computing arrangement 700 to perform various functions as described in the foregoing. The memory 706, the secondary storage 710, and any other storage are possible examples of computer-readable media.

In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 704, a graphics processor coupled to a communication interface 712, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 704 and a graphics processor, a chipset (i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.).

Furthermore, the architectures and functionalities depicted in the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computing arrangement 700 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system. Furthermore, the computing arrangement 700 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, etc. Additionally, although not shown, the computing arrangement 700 may be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 708.

It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.

Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A method for managing the storage of data segments on a storage device (100), the method comprising: storing a plurality of fixed-size data segments (102A-N, 204 A-N) of a determined fixed size on the storage device (100), receiving an input data stream (202), segmenting the input data stream (202) into a plurality of variable size data segments (206) each having a variable size, creating offset data segments (208A-B) from the variable size data segments (206) by offsetting the end of each variable size segment so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size, identifying duplicate data in the input data stream (202) by detecting said duplicate data in the offset data segments (208A-B).

2. A method according to claim 1, wherein segmenting the input data stream (202) uses as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream (202).

3. A method according to claim 2, wherein, if the determined fixed size is greater than a first predetermined value, the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size.

4. A method according to any of claims 2 and 3, wherein if the determined fixed size is smaller than a second predetermined value, the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size.

5. A method according to any of claims 1 to 4, further comprising inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.

6. A method according to claim 5, wherein the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments (208A-B), depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments (206).

7. A method according to any of claims 1 to 6, further comprising storing the offset data segment into a repository.

8. A method according to any of claims 1 to 7, further comprising storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream (202) relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.

9. A method according to any of claims 1 to 8, wherein creating offset data segments (208A-B) from the variable size data segments (206), comprises padding the end of each variable size segment with a padding character so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size.

10. A storage device (100) configured to: store a plurality of fixed-size data segments (102A-N, 204 A-N) of a determined fixed size, receive an input data stream (202), segment the input data stream (202) into a plurality of variable size data segments (206) each having a variable size, create offset data segments (208A-B) from the variable size data segments (206) by offsetting the end of each variable size segment so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size, identify duplicate data in the input data stream (202) by detecting said duplicate data in the offset data segments (208A-B).

11. A storage device (100) according to claim 10, further configured to segment the input data stream (202), using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream (202).

12. A storage device (100) according to claim 11, further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.

13. A storage device (100) according to any of claims 11 and 12, further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.

14. A storage device (100) according to any of claims 10 to 13, further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.

15. A storage device (100) according to claim 14, further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments (208A-B), depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments (206).

16. A storage device (100) according to any of claims 10 to 15, further comprising a repository for storing the offset data segment.

17. A storage device (100) according to any of claims 10 to 16, further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream (202) relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.

18. A storage device (100) according to any of claims 10 to 17, further configured to create the offset data segments (208A-B) from the variable size data segments (206) by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size.