WO2024051953A1 - Data storage system and method for segmenting data - Google Patents

Data storage system and method for segmenting data Download PDF

Info

Publication number
WO2024051953A1
WO2024051953A1 PCT/EP2022/075121 EP2022075121W WO2024051953A1 WO 2024051953 A1 WO2024051953 A1 WO 2024051953A1 EP 2022075121 W EP2022075121 W EP 2022075121W WO 2024051953 A1 WO2024051953 A1 WO 2024051953A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
data
segments
group
segment
Prior art date
Application number
PCT/EP2022/075121
Other languages
French (fr)
Inventor
Yair Toaff
Assaf Natanzon
Aviv Kuvent
Idan Zach
Michael Sternberg
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2022/075121 priority Critical patent/WO2024051953A1/en
Publication of WO2024051953A1 publication Critical patent/WO2024051953A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms

Definitions

  • the present disclosure relates to storage systems, more particularly, the disclosure relates to a data storage system and a method for segmenting data.
  • Storage devices enable a computer to store data for any temporary or permanent purpose. Integrated hardware and software systems can capture, manage, secure and prioritize the data. Disk storage in computer systems is reaching more volumes and is no longer big enough to accommodate the needs.
  • First step in increasing the volumes is to add deduplication (dedup) functionality to the disk storage where it finds a common area in the data and saves existing data only once.
  • dedup enables a system to handle 20 times (or even more) amount of user data.
  • Data is also moved from primary storage such as disks with dedup to secondary storage such as tape technologies due to the tape's cheaper price and long retention. Further, to make the price even cheaper, dedup is added to the tape. Further, amounts of data in tape-based system may be 100 times that of a disk dedup system, but usage in volumes can be handled efficiently as the segment parameters of the tape-based systems are usually larger.
  • Existing solution provides a process of segmenting the data, and using hashes of the segments as fingerprints, and managing the index and using technics such as bloom filters, etc.
  • Some of the existing solutions disclose data written to the secondary storage (also known as cold tier) is segmented according to a cold tier segment size without optimization, where the data has to be moved from the primary storage (also known as hot tier) to be inflated before re-segmentation. Additional CPU resources are required on segmentation and hashing for a second time.
  • the present disclosure aims to improve the segmentation of data of existing systems or technologies in the data storage systems.
  • the present disclosure provides segmentation of data in a data storage system.
  • a method of segmenting data residing in a first storage to a second storage segmentation comprising: dividing one or more segments in a first storage into a plurality of groups, wherein each group is configured to be within a desired range of segment sizes in a second storage; creating a new hash value for each group based on existing hash values of segments within each group; checking the new hash values in a deduplication index of the second storage to identify groups that already exist; and selecting the groups to move from the first storage to the second storage such that duplication of data is minimized.
  • the disclosed method is directed to a system that includes a server and a data storage system.
  • the server performs all the needed calculations for segmentation, hashing, and maintains the different data structures needed.
  • the data storage system includes at least a hot tier of storage (SSDs) and a cold tier including tape-based which may be a tape library or build-in tape drives. All values of segment size and resolution in sparse indices are configurable where the cold tier segment size is larger than the hot storage tier segment size. For example, 16KB is an average segment size for the hot tier segment and 32KB is an average segment size for the cold tier segment.
  • the method writes the data to the hot tier segment and later moved into the cold tier segment or writes the data directly to the cold tier.
  • the method employs any suitable algorithm that compares the fingerprints of a buffer instead of comparing two or more buffers byte by byte.
  • the method uses the data fingerprint such as SHA1 that provides a very low chance of getting a false positive, whereas a 1PB storage with segments of -16KB includes 2 36 segments, whereas the SHA1 signature is 20 bytes i.e. 2 160 , that gives a false positive chance of 1 :2 124 .
  • the first storage includes one or more versions of the data to be migrated.
  • the method further includes checking different combinations of segment groups against the data in the second storage to demarcate new and existing data.
  • data is migrated directly to an empty second storage based on a segmentation parameter of the first storage such that a desired average segment size is attained for the second storage.
  • data is migrated to a non-empty second storage by comparing hash values of various segment groups in the first storage to that of the existing segments in the second storage and grouping the segments based on a desired segment size of the second storage.
  • the segments found within a predefined minimal and maximal segment size are grouped together.
  • the segments that are not found are grouped in a group with a desired average segment size within the predefined minimal and maximal segment size of the second storage.
  • the method includes creating the new hash value including creating a buffer in a size defined by maximum number of segments in a group multiplied by the size of the existing hash value, and copying the existing hash values of the segments in the group sequentially to create a unique identifier for the group.
  • the maximum number of segments in the group is calculated by dividing a maximal segment size of the second storage by a minimal segment size of the first storage.
  • the method includes padding the buffer with zeroes to reach its size.
  • a data storage system including: a random-access memory; a first storage; a second storage; a plurality of indices for storing values of segment sizes of the second storage; and a processor configured to: divide one or more segments in the first storage into a plurality of groups, wherein each group is configured to be within a desired range of segment size in the second storage; create a new hash value for each group based on existing hash values of segments within each group; check the new hash values in a deduplication index of the second storage to identify groups that already exist; and select the groups to move from the first storage to the second storage such that duplication of data is minimized.
  • the data storage system includes a server that performs all the needed calculations for segmentation, hashing, and maintaining the different data structures needed.
  • the data storage system includes a large Random- Access Memory (RAM).
  • the data storage system contains at least a hot tier of storage (SSDs) and a cold tier including tape based which may be a tape library or build-in tape drives. All values of segment size and resolution in sparse indices are configurable where the cold tier segment size is larger than the hot storage tier segment size. For example, 16KB is an average segment size for the hot tier segment and 32KB is an average segment size for the cold tier segment.
  • the data storage system writes the data to the hot tier segment and later moved into the cold tier segment or writes the data directly to the cold tier.
  • the first storage is a disk storage medium and the second storage is a tape storage medium.
  • FIG. 1 is a block diagram of a data storage system in accordance with an implementation of the present disclosure
  • FIG. 2 is a block diagram of a data storage system for segmenting and storing data in accordance with an implementation of the present disclosure
  • FIG. 3 A is an exemplary diagram of one or more segments within each group in a data storage system in accordance with an implementation of the present disclosure
  • FIG. 3B is an exemplary diagram of one or more segments for writing data using a data storage system in accordance with an implementation of the present disclosure
  • FIG. 4 is a flow diagram that illustrates a method of segmenting and moving data residing in a first storage to a second storage in accordance with an implementation of the present disclosure.
  • Implementations of the present disclosure provide a data storage system and a method for segmenting data.
  • a process, a method, a system, a product, or a device that includes a series of steps or units is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
  • Data fingerprint refers to a strong hash function, for exampleSHAl that provides 20-byte crypto hash or latest versions of SHA1 that also provides a longer hash value.
  • the data fingerprint is used in a dedup process. Any suitable algorithm is used to compare the fingerprints of a buffer instead of comparing two or more buffers byte by byte.
  • Segmentation is a process of dividing a large buffer of data to smaller parts, where the segments may be constant in size or variable.
  • Constant size segmentation In constant size segmentation, the data is segmented in the same size. This type of segmentation is rarely used in dedup system as a small insertion or deletion of a few bytes would change all the segments on insertion or deletion, and therefore dedup cannot happen.
  • Variable size segmentation In variable size segmentation, the data is divided into variable size segments according to a window of data, where usually a rolling hash function may be used on a rolling window of 64-256 bytes. When the hash value of the rolling hash accommodates a certain condition, any type of algorithm may cut the segment and start a new one. The condition may be defined by a minimal size/average size/maximum size of the segments.
  • FIG. 1 is a block diagram of a data storage system 100 in accordance with an implementation of the present disclosure.
  • the data storage system 100 includes a random-access memory 102, a first storage 104, a second storage 106, one or more indices 108A-N, and a processor 110.
  • the one or more indices 108A-N for storing values of segments sizes of the second storage 106.
  • the processor 110 is configured to divide one or more segments in the first storage 104 into one or more groups. Each group is configured to be within a desired range of segment size in the second storage 106.
  • the processor 110 is configured to create a new hash value for each group based on existing hash values of segments within each group.
  • the processor 110 is configured to check the new hash values in a deduplication index of the second storage 106 to identify groups that already exist.
  • the processor 110 is configured to select the groups to move from the first storage 104 to the second storage 106 such that duplication of data is minimized.
  • the data storage system 100 includes a server that performs all the needed calculations for segmentation, hashing, and maintaining the different data structures needed.
  • the data storage system 100 includes a Random- Access Memory (RAM).
  • the data storage system 100 includes at least a hot tier of storage (SSDs) and a cold tier of storage such as tape based technology which may be a tape library or build-in tape drives. All values of segment size and resolution in sparse indices are configurable where the cold tier segment size is larger than the hot storage tier segment size. For example, 16KB is an average segment size for the hot tier segment and 32KB is an average segment size for the cold tier segment.
  • the data storage system 100 writes the data to the hot tier segment and later moved into the cold tier segment or writes the data directly to the cold tier.
  • the data storage system 100 uses the data fingerprint such as SHA1 which provides a very low chance of getting a false positive, whereas a 1PB storage with segments of -16KB includes 2 36 segments, whereas the SHA1 signature is 20 bytes i.e. 2 160 , which gives a false positive chance of 1 :2 124 .
  • the first storage 104 is a disk storage medium and the second storage 106 is a tape storage medium.
  • FIG. 2 is a block diagram of a data storage system 200 for segmenting and storing data in accordance with an implementation of the present disclosure.
  • the data storage system 200 includes a hot tier 202 with one or more SSDs, a cold tier 206 with one or more SSDs, a global sparse index 208, a sparse index 210, a full index 212, and one or more storage modules 214A- N.
  • the data storage system 200 migrates data segmented according to a disk-based system to data segmented according to larger values without inflating and repeating the whole segmentation and hashing process from zero.
  • the data storage system 200 uses the segmentation according to hot tier segment parameter as a base for calculating a representation for the cold tier 206 without inflating the data and recalculating the representation.
  • the hot tier 202 (also known as a first storage) includes one or more segments that are divided into a plurality of groups. Each group is configured to be within a desired range of segment sizes in the cold tier 206 (also known as a second storage).
  • the cold tier segmentation required sizes may be any of minimum, maximum or average predefined sizes.
  • the data storage system 200 may employ an algorithm that uses knowledge from the way of data is deduped in the hot tier 202 and try one or more different divisions to find the best way among them.
  • FIG. 3A is an exemplary diagram 300 of one or more segments within each group in a data storage system in accordance with an implementation of the present disclosure.
  • the algorithm that determines that there are several versions of data in a hot tier (also known as a first storage) and old versions need to be moved to a cold tier (also known as a second storage).
  • the algorithm checks a required version of data that needs to be moved and the required version of data following the movement.
  • the data storage system may mark segments as any of unique or common, for each segment of the last version.
  • the exemplary diagram 300 shows splitting of segments according to minimal segment size shown as VI, maximal segment size shown as VO, and segments that are marked as common shown as VO’.
  • the segments that are marked as common include their own new segment according to minimal segment size or maximal segment size. If segment J is not big enough, the data storage system enables segment ‘D to join with segment J to reach the minimum size.
  • the data storage system writes the data directly to the cold tier which is empty.
  • the algorithm may segment the data according to hot tier segmentation parameter and takes every K segment together to reach an average segment size for the cold tier.
  • FIG. 3B is an exemplary diagram 302 of one or more segments for writing data using a data storage system in accordance with an implementation of the present disclosure.
  • the data may be segmented according to hot tier segmentation parameters if the data is written directly to a cold tier.
  • the data storage system splits segments into larger segments by starting with first segment, using an algorithm.
  • the algorithm checks hashes for segments A, A-B, A-C, A-D from a first group that passes requirements including a minimal size up to the group that exceeds the maximal size, the fingerprints for these groups may be checked against the index of the cold tier.
  • the algorithm forms a group if any of the hashes passes the requirements.
  • the algorithm may be repeated with segments after the group until all segments are divided into larger segments.
  • segments when segments are not found by the algorithm, segments may not be grouped into groups, where their group size varies and is not within the average size needed and may be outside the minimal and maximum sizes of segments.
  • the algorithm may cancel the found group when the found group is not within the minimal size or maximum size.
  • the algorithm may perform a query to more fingerprints at a same time (A, A-B, . . . , A-D, B, B-C ... , E’, E’-F . . . , E’-G) to store the data even if some of segments may not be used.
  • the exemplary diagram 302 shows the cold tier and new data.
  • the algorithm starts with segment A, where segment A is not big enough.
  • the algorithm may check for combinations of A-B, A-C, A-D, A-E’ and the like to find a match for segment A, for the minimal size. Segments A-C may be suitable for the minimal size.
  • segment D continues with segment D, and checks for D-E’, D-F, and may not found anything as the other segments are not in a proper size.
  • the algorithm continues with segment E, and checks for E’-F, E’-G, and may not found anything as the other segments are not in a proper size.
  • the algorithm continues with segment F, and checks for segment F as it is already big enough, F-G, F-H’ and found only F.
  • the algorithm continues with segments G, H’ and I’, and may not found anything as the other segments are not of proper size.
  • the result generated by the algorithm in the present example would be: Found segments - A-C, F, and New segments - D-E’, G-f.
  • any kind of hash function can be used for optimizations.
  • the hash function used is SHA1 fingerprints, where the algorithm creates a fingerprint from one or more SHA1 fingerprints and creates a buffer in a size defined by maximum number of segments in a group * size of SHA1.
  • the maximum number of segments in a group is a cold tier maximum segment size or a hot tier minimal segment size.
  • the algorithm copies the fingerprints of the segments in the group one after the other into a buffer and pad what is left with 0.
  • the buffer is a unique identifier to the group.
  • the algorithm may use SHA1 again on the buffer to create handling of the fingerprint easier and smaller.
  • FIG. 4 is a flow diagram 400 that illustrates a method of segmenting and moving data residing in a first storage to a second storage in accordance with an implementation of the present disclosure.
  • the method includes dividing one or more segments in the first storage into one or more groups. Each group is configured to be within a desired range of segment sizes in the second storage.
  • the method includes creating a new hash value for each group based on existing hash values of segments within each group.
  • the method includes checking the new hash values in a deduplication index of the second storage to identify groups that already exist.
  • the method includes selecting the groups to move from the first storage to the second storage such that duplication of data is minimized.
  • the method employs any suitable algorithm that compares the fingerprints of a buffer instead of comparing two or more buffers byte by byte.
  • the method uses the data fingerprint as SHA1 which provides a very low chance of getting false positive, whereas a 1PB storage with segments of -16KB includes 2 36 segments, whereas the SHA1 signature is 20 bytes i.e. 2 160 , which gives a false positive chance of 1 :2 124 .

Abstract

Provided a method of segmenting data residing in a first storage to a second storage segmentation. The method includes dividing one or more segments in a first storage 104 into a plurality of groups. Each group is configured to be within a desired range of segment sizes in a second storage 106. The method includes creating a new hash value for each group based on existing hash values of segments within each group. The method includes checking the new hash values in a deduplication index of the second storage to identify groups that already exist. The method includes selecting the groups to move from the first storage to the second storage such that duplication of data is minimized.

Description

DATA STORAGE SYSTEM AND METHOD FOR SEGMENTING DATA
TECHNICAL FIELD
The present disclosure relates to storage systems, more particularly, the disclosure relates to a data storage system and a method for segmenting data.
BACKGROUND
Storage devices enable a computer to store data for any temporary or permanent purpose. Integrated hardware and software systems can capture, manage, secure and prioritize the data. Disk storage in computer systems is reaching more volumes and is no longer big enough to accommodate the needs.
First step in increasing the volumes is to add deduplication (dedup) functionality to the disk storage where it finds a common area in the data and saves existing data only once. In backup systems with a lot of repeated data, dedup enables a system to handle 20 times (or even more) amount of user data. Data is also moved from primary storage such as disks with dedup to secondary storage such as tape technologies due to the tape's cheaper price and long retention. Further, to make the price even cheaper, dedup is added to the tape. Further, amounts of data in tape-based system may be 100 times that of a disk dedup system, but usage in volumes can be handled efficiently as the segment parameters of the tape-based systems are usually larger.
Existing solution provides a process of segmenting the data, and using hashes of the segments as fingerprints, and managing the index and using technics such as bloom filters, etc. Some of the existing solutions disclose data written to the secondary storage (also known as cold tier) is segmented according to a cold tier segment size without optimization, where the data has to be moved from the primary storage (also known as hot tier) to be inflated before re-segmentation. Additional CPU resources are required on segmentation and hashing for a second time.
Therefore, the present disclosure aims to improve the segmentation of data of existing systems or technologies in the data storage systems. SUMMARY
It is an object of the present disclosure to provide a data storage system and a method for segmenting data while avoiding one or more disadvantages of prior art approaches.
This object is achieved by the features of the independent claims. Further implementations are apparent from the dependent claims, the description, and the figures.
The present disclosure provides segmentation of data in a data storage system.
According to a first aspect, there is provided a method of segmenting data residing in a first storage to a second storage segmentation comprising: dividing one or more segments in a first storage into a plurality of groups, wherein each group is configured to be within a desired range of segment sizes in a second storage; creating a new hash value for each group based on existing hash values of segments within each group; checking the new hash values in a deduplication index of the second storage to identify groups that already exist; and selecting the groups to move from the first storage to the second storage such that duplication of data is minimized.
The disclosed method is directed to a system that includes a server and a data storage system. The server performs all the needed calculations for segmentation, hashing, and maintains the different data structures needed. The data storage system includes at least a hot tier of storage (SSDs) and a cold tier including tape-based which may be a tape library or build-in tape drives. All values of segment size and resolution in sparse indices are configurable where the cold tier segment size is larger than the hot storage tier segment size. For example, 16KB is an average segment size for the hot tier segment and 32KB is an average segment size for the cold tier segment. The method writes the data to the hot tier segment and later moved into the cold tier segment or writes the data directly to the cold tier.
The method employs any suitable algorithm that compares the fingerprints of a buffer instead of comparing two or more buffers byte by byte. The method uses the data fingerprint such as SHA1 that provides a very low chance of getting a false positive, whereas a 1PB storage with segments of -16KB includes 236 segments, whereas the SHA1 signature is 20 bytes i.e. 2160, that gives a false positive chance of 1 :2124.
Preferably, the first storage includes one or more versions of the data to be migrated.
Preferably, the method further includes checking different combinations of segment groups against the data in the second storage to demarcate new and existing data.
Preferably, data is migrated directly to an empty second storage based on a segmentation parameter of the first storage such that a desired average segment size is attained for the second storage.
Preferably, data is migrated to a non-empty second storage by comparing hash values of various segment groups in the first storage to that of the existing segments in the second storage and grouping the segments based on a desired segment size of the second storage.
Preferably, the segments found within a predefined minimal and maximal segment size are grouped together.
Preferably, the segments that are not found are grouped in a group with a desired average segment size within the predefined minimal and maximal segment size of the second storage.
Preferably, the method includes creating the new hash value including creating a buffer in a size defined by maximum number of segments in a group multiplied by the size of the existing hash value, and copying the existing hash values of the segments in the group sequentially to create a unique identifier for the group.
Preferably, the maximum number of segments in the group is calculated by dividing a maximal segment size of the second storage by a minimal segment size of the first storage.
Preferably, the method includes padding the buffer with zeroes to reach its size.
Preferably, the existing hash value is used again on the buffer. According to a second aspect, there is provided a data storage system including: a random-access memory; a first storage; a second storage; a plurality of indices for storing values of segment sizes of the second storage; and a processor configured to: divide one or more segments in the first storage into a plurality of groups, wherein each group is configured to be within a desired range of segment size in the second storage; create a new hash value for each group based on existing hash values of segments within each group; check the new hash values in a deduplication index of the second storage to identify groups that already exist; and select the groups to move from the first storage to the second storage such that duplication of data is minimized.
The data storage system includes a server that performs all the needed calculations for segmentation, hashing, and maintaining the different data structures needed. The data storage system includes a large Random- Access Memory (RAM). The data storage system contains at least a hot tier of storage (SSDs) and a cold tier including tape based which may be a tape library or build-in tape drives. All values of segment size and resolution in sparse indices are configurable where the cold tier segment size is larger than the hot storage tier segment size. For example, 16KB is an average segment size for the hot tier segment and 32KB is an average segment size for the cold tier segment. The data storage system writes the data to the hot tier segment and later moved into the cold tier segment or writes the data directly to the cold tier.
Preferably, the first storage is a disk storage medium and the second storage is a tape storage medium.
These and other aspects of the present disclosure will be apparent from the implementations described below. BRIEF DESCRIPTION OF DRAWINGS
Implementations of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram of a data storage system in accordance with an implementation of the present disclosure;
FIG. 2 is a block diagram of a data storage system for segmenting and storing data in accordance with an implementation of the present disclosure;
FIG. 3 A is an exemplary diagram of one or more segments within each group in a data storage system in accordance with an implementation of the present disclosure;
FIG. 3B is an exemplary diagram of one or more segments for writing data using a data storage system in accordance with an implementation of the present disclosure; and FIG. 4 is a flow diagram that illustrates a method of segmenting and moving data residing in a first storage to a second storage in accordance with an implementation of the present disclosure.
DETAILED DESCRIPTION OF THE DRAWINGS
Implementations of the present disclosure provide a data storage system and a method for segmenting data.
To make solutions of the present disclosure more comprehensible for a person skilled in the art, the following implementations of the present disclosure are described with reference to the accompanying drawings.
Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the present disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the present disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device. Definitions:
Data fingerprint: Data fingerprint refers to a strong hash function, for exampleSHAl that provides 20-byte crypto hash or latest versions of SHA1 that also provides a longer hash value. The data fingerprint is used in a dedup process. Any suitable algorithm is used to compare the fingerprints of a buffer instead of comparing two or more buffers byte by byte.
Segmentation: Segmentation is a process of dividing a large buffer of data to smaller parts, where the segments may be constant in size or variable.
Constant size segmentation: In constant size segmentation, the data is segmented in the same size. This type of segmentation is rarely used in dedup system as a small insertion or deletion of a few bytes would change all the segments on insertion or deletion, and therefore dedup cannot happen.
Variable size segmentation: In variable size segmentation, the data is divided into variable size segments according to a window of data, where usually a rolling hash function may be used on a rolling window of 64-256 bytes. When the hash value of the rolling hash accommodates a certain condition, any type of algorithm may cut the segment and start a new one. The condition may be defined by a minimal size/average size/maximum size of the segments.
FIG. 1 is a block diagram of a data storage system 100 in accordance with an implementation of the present disclosure. The data storage system 100 includes a random-access memory 102, a first storage 104, a second storage 106, one or more indices 108A-N, and a processor 110. The one or more indices 108A-N for storing values of segments sizes of the second storage 106. The processor 110 is configured to divide one or more segments in the first storage 104 into one or more groups. Each group is configured to be within a desired range of segment size in the second storage 106. The processor 110 is configured to create a new hash value for each group based on existing hash values of segments within each group. The processor 110 is configured to check the new hash values in a deduplication index of the second storage 106 to identify groups that already exist. The processor 110 is configured to select the groups to move from the first storage 104 to the second storage 106 such that duplication of data is minimized.
The data storage system 100 includes a server that performs all the needed calculations for segmentation, hashing, and maintaining the different data structures needed. The data storage system 100 includes a Random- Access Memory (RAM). The data storage system 100 includes at least a hot tier of storage (SSDs) and a cold tier of storage such as tape based technology which may be a tape library or build-in tape drives. All values of segment size and resolution in sparse indices are configurable where the cold tier segment size is larger than the hot storage tier segment size. For example, 16KB is an average segment size for the hot tier segment and 32KB is an average segment size for the cold tier segment. The data storage system 100 writes the data to the hot tier segment and later moved into the cold tier segment or writes the data directly to the cold tier.
Any suitable algorithm is used to compare the fingerprints of a buffer instead of comparing two or more buffers byte by byte. The data storage system 100 uses the data fingerprint such as SHA1 which provides a very low chance of getting a false positive, whereas a 1PB storage with segments of -16KB includes 236 segments, whereas the SHA1 signature is 20 bytes i.e. 2160, which gives a false positive chance of 1 :2124.
Preferably, the first storage 104 is a disk storage medium and the second storage 106 is a tape storage medium.
FIG. 2 is a block diagram of a data storage system 200 for segmenting and storing data in accordance with an implementation of the present disclosure. The data storage system 200 includes a hot tier 202 with one or more SSDs, a cold tier 206 with one or more SSDs, a global sparse index 208, a sparse index 210, a full index 212, and one or more storage modules 214A- N. The data storage system 200 migrates data segmented according to a disk-based system to data segmented according to larger values without inflating and repeating the whole segmentation and hashing process from zero. The data storage system 200 uses the segmentation according to hot tier segment parameter as a base for calculating a representation for the cold tier 206 without inflating the data and recalculating the representation. The hot tier 202 (also known as a first storage) includes one or more segments that are divided into a plurality of groups. Each group is configured to be within a desired range of segment sizes in the cold tier 206 (also known as a second storage). The cold tier segmentation required sizes may be any of minimum, maximum or average predefined sizes. The data storage system 200 may employ an algorithm that uses knowledge from the way of data is deduped in the hot tier 202 and try one or more different divisions to find the best way among them. The data storage system 200 provides an easy way to calculate the hash values as the algorithm calculates the hash on different divisions. FIG. 3A is an exemplary diagram 300 of one or more segments within each group in a data storage system in accordance with an implementation of the present disclosure. The algorithm that determines that there are several versions of data in a hot tier (also known as a first storage) and old versions need to be moved to a cold tier (also known as a second storage). Preferably, the algorithm checks a required version of data that needs to be moved and the required version of data following the movement. The data storage system may mark segments as any of unique or common, for each segment of the last version. The exemplary diagram 300 shows splitting of segments according to minimal segment size shown as VI, maximal segment size shown as VO, and segments that are marked as common shown as VO’. Preferably, the segments that are marked as common include their own new segment according to minimal segment size or maximal segment size. If segment J is not big enough, the data storage system enables segment ‘D to join with segment J to reach the minimum size.
Preferably, the data storage system writes the data directly to the cold tier which is empty. The algorithm may segment the data according to hot tier segmentation parameter and takes every K segment together to reach an average segment size for the cold tier.
FIG. 3B is an exemplary diagram 302 of one or more segments for writing data using a data storage system in accordance with an implementation of the present disclosure. The data may be segmented according to hot tier segmentation parameters if the data is written directly to a cold tier. The data storage system splits segments into larger segments by starting with first segment, using an algorithm. The algorithm checks hashes for segments A, A-B, A-C, A-D from a first group that passes requirements including a minimal size up to the group that exceeds the maximal size, the fingerprints for these groups may be checked against the index of the cold tier. The algorithm forms a group if any of the hashes passes the requirements. The algorithm may be repeated with segments after the group until all segments are divided into larger segments.
Preferably, when segments are not found by the algorithm, segments may not be grouped into groups, where their group size varies and is not within the average size needed and may be outside the minimal and maximum sizes of segments. The algorithm may cancel the found group when the found group is not within the minimal size or maximum size. For optimizing the process, the algorithm may perform a query to more fingerprints at a same time (A, A-B, . . . , A-D, B, B-C ... , E’, E’-F . . . , E’-G) to store the data even if some of segments may not be used.
The exemplary diagram 302 shows the cold tier and new data. The algorithm starts with segment A, where segment A is not big enough. The algorithm may check for combinations of A-B, A-C, A-D, A-E’ and the like to find a match for segment A, for the minimal size. Segments A-C may be suitable for the minimal size. The algorithm continues with segment D, and checks for D-E’, D-F, and may not found anything as the other segments are not in a proper size. The algorithm continues with segment E, and checks for E’-F, E’-G, and may not found anything as the other segments are not in a proper size. The algorithm continues with segment F, and checks for segment F as it is already big enough, F-G, F-H’ and found only F. The algorithm continues with segments G, H’ and I’, and may not found anything as the other segments are not of proper size. The result generated by the algorithm in the present example would be: Found segments - A-C, F, and New segments - D-E’, G-f.
Preferably, any kind of hash function can be used for optimizations. Preferably, the hash function used is SHA1 fingerprints, where the algorithm creates a fingerprint from one or more SHA1 fingerprints and creates a buffer in a size defined by maximum number of segments in a group * size of SHA1. The maximum number of segments in a group is a cold tier maximum segment size or a hot tier minimal segment size. The algorithm copies the fingerprints of the segments in the group one after the other into a buffer and pad what is left with 0. Preferably, the buffer is a unique identifier to the group. The algorithm may use SHA1 again on the buffer to create handling of the fingerprint easier and smaller.
FIG. 4 is a flow diagram 400 that illustrates a method of segmenting and moving data residing in a first storage to a second storage in accordance with an implementation of the present disclosure. At a step 402, the method includes dividing one or more segments in the first storage into one or more groups. Each group is configured to be within a desired range of segment sizes in the second storage. At a step 404, the method includes creating a new hash value for each group based on existing hash values of segments within each group. At a step 406, the method includes checking the new hash values in a deduplication index of the second storage to identify groups that already exist. At a step 408, the method includes selecting the groups to move from the first storage to the second storage such that duplication of data is minimized. The method employs any suitable algorithm that compares the fingerprints of a buffer instead of comparing two or more buffers byte by byte. The method uses the data fingerprint as SHA1 which provides a very low chance of getting false positive, whereas a 1PB storage with segments of -16KB includes 236 segments, whereas the SHA1 signature is 20 bytes i.e. 2160, which gives a false positive chance of 1 :2124.
It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.

Claims

1. A method of segmenting data residing in a first storage (104) to a second storage (106) segmentation comprising: dividing one or more segments in the first storage (104) into a plurality of groups, wherein each group is configured to be within a desired range of segment sizes in the second storage (106); creating a new hash value for each group based on existing hash values of segments within each group; checking the new hash values in a deduplication index of the second storage (106) to identify groups that already exist; and selecting the groups to move from the first storage (104) to the second storage (106) such that duplication of data is minimized.
2. The method of claim 1, wherein the first storage (104) comprises one or more versions of the data to be migrated.
3. The method of claim 2, further comprises checking different combinations of segment groups against the data in the second storage (106) to demarcate new and existing data.
4. The method of claim 1, wherein data is migrated directly to an empty second storage based on a segmentation parameter of the first storage (104) such that a desired average segment size is attained for the second storage (106).
5. The method of claim 1, wherein data is migrated to a non-empty second storage by comparing hash values of various segment groups in the first storage (104) to that of the existing segments in the second storage (106) and grouping the segments based on a desired segment size of the second storage (106).
6. The method of claim 5, wherein the segments found within a predefined minimal and maximal segment size are grouped together.
7. The method of claim 6, wherein the segments that are not found are grouped in a group with a desired average segment size within the predefined minimal and maximal segment size of the second storage (106).
8. The method of any preceding claim, wherein creating the new hash value comprises: creating a buffer in a size defined by maximum number of segments in a group multiplied by the size of the existing hash value; and copying the existing hash values of the segments in the group sequentially to create a unique identifier for the group.
9. The method of claim 8, wherein the maximum number of segments in the group is calculated by dividing a maximal segment size of the second storage (106) by a minimal segment size of the first storage (104).
10. The method of claim 8 or 9, further comprising padding the buffer with zeroes to reach its size.
11. The method of any one of claims 8 to 10, wherein the existing hash value is used again on the buffer.
12. A data storage system (100, 200) comprising: a random-access memory; a first storage (104); a second storage (106); a plurality of indices (108A-N) for storing values of segment sizes of the second storage (106); and a processor (110) configured to: divide one or more segments in the first storage (104) into a plurality of groups, wherein each group is configured to be within a desired range of segment size in the second storage (106); create a new hash value for each group based on existing hash values of segments within each group; check the new hash values in a deduplication index of the second storage (106) to identify groups that already exist; and select the groups to move from the first storage (104) to the second storage (106) such that duplication of data is minimized.
13. The data storage system (100, 200) of claim 12, wherein the first storage (104) is a disk storage medium and the second storage (106) is a tape storage medium.
PCT/EP2022/075121 2022-09-09 2022-09-09 Data storage system and method for segmenting data WO2024051953A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/075121 WO2024051953A1 (en) 2022-09-09 2022-09-09 Data storage system and method for segmenting data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/075121 WO2024051953A1 (en) 2022-09-09 2022-09-09 Data storage system and method for segmenting data

Publications (1)

Publication Number Publication Date
WO2024051953A1 true WO2024051953A1 (en) 2024-03-14

Family

ID=83692898

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/075121 WO2024051953A1 (en) 2022-09-09 2022-09-09 Data storage system and method for segmenting data

Country Status (1)

Country Link
WO (1) WO2024051953A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110099351A1 (en) * 2009-10-26 2011-04-28 Netapp, Inc. Use of Similarity Hash to Route Data for Improved Deduplication in a Storage Server Cluster
US20140358871A1 (en) * 2013-05-28 2014-12-04 International Business Machines Corporation Deduplication for a storage system
US20210389898A1 (en) * 2020-06-11 2021-12-16 Hitachi, Ltd. Storage device and data migration method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110099351A1 (en) * 2009-10-26 2011-04-28 Netapp, Inc. Use of Similarity Hash to Route Data for Improved Deduplication in a Storage Server Cluster
US20140358871A1 (en) * 2013-05-28 2014-12-04 International Business Machines Corporation Deduplication for a storage system
US20210389898A1 (en) * 2020-06-11 2021-12-16 Hitachi, Ltd. Storage device and data migration method

Similar Documents

Publication Publication Date Title
US10031675B1 (en) Method and system for tiering data
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US8914338B1 (en) Out-of-core similarity matching
US9141633B1 (en) Special markers to optimize access control list (ACL) data for deduplication
US9514138B1 (en) Using read signature command in file system to backup data
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US5732265A (en) Storage optimizing encoder and method
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
US10339112B1 (en) Restoring data in deduplicated storage
US10936228B2 (en) Providing data deduplication in a data storage system with parallelized computation of crypto-digests for blocks of host I/O data
US8959089B2 (en) Data processing apparatus and method of processing data
US9720928B2 (en) Removing overlapping ranges from a flat sorted data structure
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US8627026B2 (en) Storage apparatus and additional data writing method
US9223660B2 (en) Storage device to backup content based on a deduplication system
US20090204636A1 (en) Multimodal object de-duplication
US20150039571A1 (en) Accelerated deduplication
US8836548B1 (en) Method and system for data compression at a storage system
WO2008127595A1 (en) Cluster storage using delta compression
US10838923B1 (en) Poor deduplication identification
US11314598B2 (en) Method for approximating similarity between objects
US10459648B1 (en) Change rate estimation
US20170344579A1 (en) Data deduplication
CN111522502B (en) Data deduplication method and device, electronic equipment and computer-readable storage medium
US9594643B2 (en) Handling restores in an incremental backup storage system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22789865

Country of ref document: EP

Kind code of ref document: A1