WO2015024160A1 - 一种数据对象处理方法与装置 - Google Patents

一种数据对象处理方法与装置 Download PDF

Info

Publication number
WO2015024160A1
WO2015024160A1 PCT/CN2013/081757 CN2013081757W WO2015024160A1 WO 2015024160 A1 WO2015024160 A1 WO 2015024160A1 CN 2013081757 W CN2013081757 W CN 2013081757W WO 2015024160 A1 WO2015024160 A1 WO 2015024160A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
length
block
data segment
compression rate
Prior art date
Application number
PCT/CN2013/081757
Other languages
English (en)
French (fr)
Inventor
魏建生
朱俊华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to KR1020157021462A priority Critical patent/KR101653692B1/ko
Priority to RU2015139685A priority patent/RU2626334C2/ru
Priority to CA2898667A priority patent/CA2898667C/en
Priority to PCT/CN2013/081757 priority patent/WO2015024160A1/zh
Priority to EP13892074.9A priority patent/EP2940598B1/en
Priority to CN201380003213.3A priority patent/CN105051724B/zh
Priority to JP2015561907A priority patent/JP6110517B2/ja
Priority to BR112015023973-0A priority patent/BR112015023973B1/pt
Publication of WO2015024160A1 publication Critical patent/WO2015024160A1/zh
Priority to US14/801,421 priority patent/US10359939B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • H03M7/3064Segmenting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/068Hybrid storage device
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement

Definitions

  • the present invention relates to the field of information technology, and in particular, to a data object processing method and apparatus. Background technique
  • Data Deduplication is the process of discovering and eliminating duplicate content in a data set or data stream to improve the storage and/or transmission efficiency of data. It is also called Duplicate Data Elimination, referred to as deduplication or deduplication. .
  • Deduplication techniques typically split a data set or data stream into a series of data units and retain only one copy of the repeated data units, thereby reducing the space overhead in the data storage process or the bandwidth consumption during transmission.
  • the hash value h) of the data block can be calculated as a fingerprint, and the data unit having the same fingerprint is defined as duplicate data.
  • Deduplication units commonly used in the prior art include files, fixed length blocks, content-based variable length chunks, and the like.
  • the Content Defined Chunking (CDC) uses a sliding window to scan data and identify a byte string that conforms to a preset feature, and marks the location of the byte string as a block boundary, and then the data set or data. The stream is split into variable length block sequences.
  • the method selects block boundaries based on the content characteristics of the data, and can more sensitively find data units shared by similar files or data streams, and thus is widely used in various data deduplication schemes.
  • Research shows that when using the content-based blocking method to split the data set or data stream, the smaller the granularity of the block, the higher the probability of finding duplicate data, the better the deduplication effect; but the smaller the granularity of the block means The more blocks are obtained by dividing a given data set, thereby increasing the index overhead and the complexity of finding duplicate data, thereby reducing the time efficiency of data deduplication.
  • the Content Defined Chunking (CDC) method is used to control the key parameters of the block granularity.
  • the CDC blocking method outputs a variable length block sequence for a given data object, each block The length is statistically obeying a normal distribution, and the expected length is used to adjust the mean of the normal distribution.
  • the mean of the normal distribution is expressed as an average block length. Since the random variable has the highest probability of taking the mean value under the normal distribution, the average block length is also called the peak length. Ideally it can be equal to the desired length.
  • the prior art proposes a content-based bimodal block method.
  • the core idea is to use two variable length blocking modes with different expected lengths.
  • the deduplication storage system is used to determine the repeatability of the candidate blocks, between the repeated data and the non-repeating data.
  • the transition zone uses a small block mode and the non-transition zone uses a large block mode.
  • the block computing device needs to frequently query the fingerprint of the existing data block in the storage device when determining how the data object is divided.
  • the data stored in the heavy storage device is deduplicated.
  • the data block determines whether a transition region of the repeated data and the non-repeating data is encountered according to the repeatability of the candidate block, and further determines which block mode is finally adopted. Therefore, this prior art has caused query load pressure on the deduplication storage device. Summary of the invention
  • Embodiments of the present invention provide a data object processing technique, which can split a data object into data blocks.
  • an embodiment of the present invention provides a data object processing method, including: dividing a data object into one or more partitions; calculating a sampling compression ratio of each partition, and successively partitioning the sampling compression ratio with common features Aggregating into a data segment, obtaining a sampling compression ratio of each of the data segments; selecting a length interval according to a length of each of the data segments, and a compression ratio interval to which a sampling compression ratio of each data segment belongs The desired length splits the data segment into data blocks, wherein the sample compression ratio of each of the data segments uniquely belongs to one of the compression rate intervals, and each of the data The length of the segment uniquely belongs to one of the length intervals.
  • an embodiment of the present invention provides a data object processing apparatus, where the apparatus includes: a partition dividing module, configured to divide a data object into one or more partitions; and a data segment generating module, configured to calculate a sampling of each partition.
  • a compression ratio a continuous partition having a common feature of the sampling compression ratio is aggregated into one data segment, and a sampling compression ratio of each of the data segments is obtained;
  • a data block generating module according to a length interval to which each length of the data segment belongs, a compression ratio interval to which the sampling compression ratio of each data segment belongs, selecting a desired length to split the data segment into data blocks, wherein the sampling compression ratio of each of the data segments uniquely belongs to one of the compression ratio intervals.
  • the length of each of the data segments uniquely belongs to one of the length intervals.
  • the data object is first divided into partitions, and then the partition is aggregated into data segments according to the compression ratio of each partition, and then split into data blocks, thereby realizing the effect of splitting the data objects into data blocks.
  • FIG. 1 is a flow chart of an embodiment of a data object processing method
  • FIG. 2 is a flow chart of an embodiment of a data object processing method
  • FIG. 3 is a flow chart of an embodiment of a data object processing method
  • FIG. 4 is a flow chart of an embodiment of a data object processing method
  • FIG. 5 is a flow chart of an embodiment of a data object processing method
  • Figure 6 is a structural diagram of an embodiment of a data object processing apparatus. detailed description
  • the prior art proposes to adopt fine-grained partitioning in the transition area between the repeated content and the non-repetitive content, and coarse-grained blocking in other areas, thereby forming a bimodal blocking method.
  • this method needs to frequently query the repeatability of candidate blocks in the blocking process, thereby causing query load pressure on the deduplication storage system.
  • the block result of this method depends on the transmission order of the multi-version data and is not stable.
  • the embodiment of the invention provides a multi-peak data de-blocking method based on content and compressibility, which can split the data object into data blocks.
  • the data object is first divided into partitions, and then the partition is aggregated into data segments according to the sampling compression ratio of each partition, and then the desired length is selected according to the length of the data segment and the sampling compression ratio.
  • Splitting into data blocks separately achieves the effect of splitting the data object into a sequence of data blocks having a multi-peak length distribution.
  • a data object refers to a piece of data that can be operated, such as a file or a data stream.
  • the partitioning strategy mapping table maintains the mapping relationship between the compression rate interval, the length interval, and the expected segment length. The higher the compression ratio of the data segment and the longer the length, the larger the corresponding expected segment length.
  • the method of the embodiment of the present invention can split the data object into data blocks, and the split data block can be used as a unit for data deduplication, thereby reducing storage space occupation and not storing data, and storing or transmitting the data object.
  • the data blocks generated by the split can also be used for purposes other than data deduplication.
  • Embodiments of the present invention include: Step (a), a data object (Data Object) is input to a blocking device, and the data object may be from a block device or may be from a memory inside the blocking device, such as a file or a data stream, The operation requirement is not limited in this embodiment; in step (b), the data object is divided into one or more partitions (Block), and the compression ratio of each partition is estimated by using the sampling method, and the block policy mapping table is queried.
  • a data object Data Object
  • the data object is divided into one or more partitions (Block)
  • the compression ratio of each partition is estimated by using the sampling method, and the block policy mapping table is queried.
  • step (c) for each data segment, query the blocking policy mapping table according to its corresponding compression ratio
  • the interval and the length interval select a desired length, and the content-based blocking method divides the data segment into a chunked (Chunk) sequence according to the selected desired length; and (d), the adjacent chunks located in different data segments
  • the splicing is performed, and the hash value of each block is calculated for data deduplication, wherein the splicing step is an optional step, and the hash value may be directly calculated without splicing.
  • Step (b) using the sampling compression rate information of the partition sequence to re-divide the data object into a sequence of data segments may be various.
  • step (b) may also refer to the compression rate interval.
  • the adjacent consecutive partitions whose sampling compression ratio difference is less than the specified threshold are aggregated into one data segment.
  • Step 11 Divide the data object to be de-duplicated into one or more partitions, calculate a sampling compression ratio of each partition, and aggregate adjacent consecutive partitions having a common compression rate feature into one data segment, and obtain each of the data segments. Sampling compression ratio.
  • each partition can be fixed length or indefinite length. In the case of a fixed length, a random length may be selected as a partition length within a certain range; a data object may also be scanned and multiple sets of candidate block boundaries having different desired lengths may be output, and a set of candidate block boundaries may be used.
  • the data object is divided into one or more variable length partitions.
  • a method for estimating the compression ratio is: for each partition, extract a piece of sample data according to the sampling ratio S, and compress the sample data by a data compression algorithm, such as a lossless compression algorithm LZ algorithm, RLE coding, calculate a compression ratio, and compress the sample. Rate as the compression ratio of the sampled partition. The higher the sampling ratio, the closer the compression ratio of the sample is to the true compression ratio of the sampled partition.
  • the compression ratio is generally less than 1; for an incompressible data segment, since the metadata overhead such as the description field is increased, the length of the compression encoding may be greater than the original data length, so that the compression ratio may occur. More than 1 case.
  • the compression rate entry contains a series of compression rate intervals. The intersection between different compression rate intervals is empty, and the union of all compression rate intervals constitutes the complete compression rate range [0, ⁇ ).
  • the compression rate feature refers to the compression ratio as a parameter of the aggregation partition, the compression ratio or the compression ratio.
  • the partition obtained by the row operation conforms to the preset condition, that is, the partition conforming to the compression rate feature.
  • the compression rate feature may be a compression rate interval, that is, a range of compression ratios; or may be a threshold value of a difference in compression ratios of adjacent partitions. Referring to the example of Table 1, the compression ratio section is used as the compression ratio feature.
  • the data object is divided into partitions 1 and 7 and a total of 7 partitions. By sampling and estimating the compression ratio for each partition, the sampling compression ratio of each partition is obtained.
  • the sampling compression ratio of partition 1 is 0.4
  • the sampling compression ratio of partition 2 is 0.42
  • each sampling compression rate belongs to a compression ratio interval, such as compression.
  • the rate of 0.4 is the compression ratio interval [0, 0.5)
  • the compression ratio of 0.61 belongs to the compression ratio interval [0.5, 0.8).
  • each partition has a sampling compression ratio, it can be considered that each partition belongs to a compression ratio interval;
  • a partition of a compression rate interval is aggregated into one data segment, so partition 1 and partition 2 can be aggregated into data segment 1, partition 3, partition 4, and partition 5 can be aggregated into data segment 2, and partition 6 and partition 7 can be aggregated into data segments. 3.
  • Step 12 According to the length interval to which the length of each data segment belongs, the compression ratio interval to which the sampling compression ratio of each data segment belongs, select a desired length to split the data segment into data blocks.
  • the length range of the data segment is divided into at least one length interval, and the length of each of the data segments uniquely belongs to one of the length intervals, and the sampling compression rate of each of the data segments belongs to only one The compression ratio interval.
  • sampling compression ratio of the data segment can be obtained by sampling and compressing the data segment; or by sampling the compression ratio of each partition constituting the data segment.
  • the average value is obtained, and the average value may be an arithmetic mean value; the weighted average value of the sampling compression ratios of each partition may also be calculated according to the length of each partition, and the specific method is to use the sampling compression ratio of each partition constituting the data segment as a flag.
  • the value, the length of each partition as a weight calculates the weighted average of the sample compression ratios for each partition.
  • this step may Selecting, from the candidate block boundaries, a block boundary having the same desired length to split the data segment into data blocks according to the selected desired length, that is, from the plurality of sets of candidate boundaries obtained by scanning in step 11. You can select the corresponding boundary without scanning the data segment again.
  • this step may be to select a desired length, and then scan the data segment to find the boundary of each partition, thereby splitting the data segment into data blocks, as compared with The method of calculating a plurality of candidate block boundaries having different desired lengths adds one step of scanning the data segments to find the block boundaries.
  • the range of the compression ratio is divided into [0, 0.5), [0.5, 0.8), [0.8,
  • the sampling compression ratio of each data segment belongs to a compression ratio interval, and can also be understood as a compression ratio interval corresponding to each data segment.
  • the length range of the data segment is divided into at least one length interval, and there is no intersection between the length intervals, so each data segment corresponds to one length interval, and the length interval and the compression rate interval jointly determine the expected length of one data segment.
  • the length of the data segment A belongs to the length [0MB, 10 MB), and the sampling compression ratio of the data segment A is within [0, 0.5), and the length interval [0MB, 10 MB), the compression ratio interval [0, 0.5).
  • each data block may be different, that is, it is indefinitely long, but in the data block formed by splitting the data segment A, the average block length is close to the expected length of 32 KB, which is uncertain.
  • the long data block has the highest probability of 32KB, and 32KB is also a peak length.
  • each desired length corresponds to one peak length. Therefore, the embodiment of the present invention is a data blocking and deduplication method with multiple peak lengths.
  • the desired length can be obtained using empirical values or by analytical statistics.
  • An optional determination rule of the desired length as shown in the example of Table 2, for the same compression rate interval, the corresponding expected length increases as the lower limit of the length interval increases; on the other hand, for the same length interval
  • the lower limit of the compression ratio interval increases, its corresponding expected length in the mapping table also increases.
  • the expected block length of the data segment is positively correlated with the lower limit of the compression rate interval to which the data segment sampling compression rate belongs and the lower limit of the length interval to which the data segment length belongs.
  • the deduplication ratio of big data objects and hard-to-compress data objects is not sensitive to block granularity. With this desired length of block mode, the number of blocks can be quickly reduced without causing rapid de-emphasis.
  • the data blocks formed by splitting each data segment have a sequence of data. These data blocks having a sequence before and after form form a data block sequence, which may also be referred to as a block subsequence.
  • the step 12 may further include splicing the block sub-sequences generated by splitting different data blocks to form a block sequence of the data object, the sequence of each block in the block sequence, and before and after the block data in the data object. The order is the same.
  • Step 13 splicing the end data block of the previous data segment and the first data block of the subsequent data segment into one data block in the adjacent data segment, and the data block formed by the splicing may be referred to as a spliced data block.
  • step 13 is an optional step. For example, if the boundary of the data segment in step 11 comes from the boundary of the fixed length partition, step 13 can be selected. After the splicing, the boundary of the fixed length partition can be avoided to be used as the boundary of the block, which can be better. The effect of deduplication.
  • the spliced data block generated by the transition region between the two data segments has adjacent data blocks corresponding to different desired lengths.
  • the spliced data block may be removed by using two smaller values of the desired length. Divided into more granular data blocks, you can better find the duplicate content of the transition area between the two data segments, and improve the deduplication effect when the total amount of data blocks increases little.
  • the spliced data block may also be split into more granular data blocks using a smaller desired length than the smaller of the two desired lengths.
  • step 13 may be performed prior to splicing the block subsequence or after splicing the block subsequence.
  • Step 14 Calculate the fingerprint of each data block, determine whether the data block has been stored in the storage device by using the fingerprint, and send the data block that is not stored in the storage device, and the fingerprint and sampling compression ratio corresponding to the unstored data block to the storage device. .
  • the fingerprint is used to uniquely identify the data block, and the fingerprint corresponding to the data block and the data block has a corresponding relationship.
  • One method of fingerprint calculation is to calculate a hash value of a data block as a fingerprint.
  • Step 15 The storage device stores the received data and the fingerprint of the data block. During storage, it can be determined whether the sampling compression rate of the received data block meets the compression rate threshold, and the data block that meets the compression rate threshold is compressed and stored to save storage space; the data that does not meet the compression rate threshold is not compressed and directly stored. . For example, if the compression rate threshold is 0.7 or less, the data block with a compression ratio of 0.7 or less may be compressed and stored, and the data with a compression ratio greater than 0.7 may be directly stored without compression.
  • the segmentation method of step 11 is more than one.
  • the segmentation strategy of step 11 can be modified to: the difference between the sample compression ratios is less than a specified threshold, and adjacent consecutive partitions are aggregated into the same data segment. That is to say, the difference between the sampling compression ratio and the specified threshold is a characteristic of the compression ratio. Still exemplified in Table 1, if the difference in sampling compression ratio is less than 0.1 Is the threshold. Then 0.42 - 0.4 ⁇ 0.1, so partition 1 and partition 2 are divided into the same data segment; and 0.53 - 0.42 > 0.1, so partition 3 is not divided into the same data segment with partition 2.
  • partition 1 and partition 2 are divided into the first data segment
  • partition 3 and partition 4 are divided into the second data segment
  • partition 5, partition 6, and partition 7 are divided into the third data segment.
  • the second embodiment is described in detail in FIG. 2.
  • the data object processing method in this embodiment includes the following steps.
  • the blocking policy mapping table refers Table 1, Table 2, and the blocking policy mapping table records which compression ratios belong to the same compression rate feature; in addition, the segmentation policy mapping table also records the length to which the data segment belongs. The compression ratio interval to which the compression ratio of the interval and the data segment belongs, and the expected length determined by the two. The desired length is used to split the data segment into data blocks. This step can also be performed later, before you need to use the blocking strategy mapping table.
  • the data object to be processed is obtained, and the source of the acquisition may be the device that executes the method, or the external device connected to the device that performs the method.
  • this step needs to scan the data object to obtain the boundary of each fixed length partition, and the data between the adjacent two boundaries is a partition.
  • the consecutive partitions having the same compression rate feature are aggregated into data segments.
  • each data segment is divided into data blocks having a specific desired length.
  • this step needs to scan each data segment, find the boundary of each data block, and split the data segment with the boundary to form a series of data blocks.
  • the execution of the data object 26 completes the processing of the data object, and the effect of splitting the data object into data blocks is realized. Subsequent steps 27, 28, and 29 are further steps on the data object processing method. Expand.
  • a splicing method is: splicing the last data block of the previous data segment and the first data block of the latter data segment into one data block for any two adjacent data segments, thereby eliminating the data segments.
  • a boundary a plurality of sub-sequences are combined to form a data block sequence of the entire data object.
  • the data block spliced in this step can be split into smaller-sized data blocks to improve the effect of deduplication.
  • Another splicing method is that each data block remains unchanged, and the block subsequences of each data segment are sorted according to the order of the data segments in the data object, and the first data segment of the next data segment takes over the end of the previous data segment.
  • the block compression ratio of the block can be obtained by using the sample compression ratio of the segment to which it belongs.
  • One method is to directly use the segmented sample compression rate as the sample compression ratio of each block contained therein.
  • a specific implementation method of the above steps is: scanning a data object by using a w-bytes sliding window, and re-calculating the fingerprint f (w-bytes) of the data in the window by using a fast hash algorithm for each forward sliding one byte, if current data
  • the function Match is used to map f(w-bytes) to the interval [0, ⁇ ). Since the hash function f) is random, the filter condition can output a series of blocks with a peak length of E.
  • a block subsequence with a new peak length is output.
  • the previous boundary of the data block is determined by the expected length of the previous data segment, and the latter boundary is determined by the expected length of the latter data segment, so before and after the data block
  • the boundaries are located in two data segments, respectively The splicing of the sub-sequences between adjacent segments is automatically completed, so the process does not introduce segment boundaries as block boundaries.
  • the second embodiment further provides a flowchart 3, and the step flow is: 3a, inputting a data object that needs to execute a data object processing method, and the data object may be temporarily stored in the cache before being processed by the executed data object; 3b , partitioning, sampling, and estimating the compression ratio of the data object, L, M, and H respectively represent three different compression rate intervals; 3c, the continuous partitions with the common compression rate feature are aggregated into data segments, 3d, according to each data segment The compression ratio and length feature select the desired length and calculate the block boundary.
  • the data segments are divided into data blocks according to the calculated block boundaries, and the data blocks split into each data segment form a block subsequence; 3e , splicing the block subsequences of each data segment and calculating the block fingerprint.
  • the data object to be deduplicated is divided into a plurality of fixed length partitions.
  • the data object to be deduplicated by the desired length is pre-blocked, specifically, the content object-based blocking method is used to scan the data object (file/data stream) and generate multiple sets of candidates with different expected lengths.
  • Block boundaries, where each candidate block boundary corresponds to a candidate block sequence of the data object, and the subsequent block sequence can also understand the block scheme.
  • the data object is divided into partitions by using one of the pre-blocking schemes, and is used for performing sample compression and determining the division method of the data segments.
  • the corresponding length may be selected according to the desired length.
  • the candidate block boundaries are used to split the data block.
  • the second embodiment needs to scan the data object when dividing the data object into partitions and splitting the data segments into data blocks.
  • the data object needs to be scanned once, which saves system resources and improves the processing efficiency of the data object.
  • the partition boundary and the data block boundary in the third embodiment are based on the candidate partition boundary, the segment boundary generated by the partition aggregation does not adversely affect the deduplication, and the operation of removing the segment boundary of the data block is not required. That is to say, it is not necessary to splicing the last data block of the previous data segment in the adjacent data segment with the first data block of the latter data segment into one data block.
  • the data object processing method in this embodiment may specifically include the following steps.
  • the blocking policy mapping table refers Table 1, Table 2, and the blocking policy mapping table records which compression ratios belong to the same compression rate feature; in addition, the segmentation policy mapping table also records the length to which the data segment belongs.
  • the data object to be processed is obtained, and the source of the acquisition may be the device that executes the method, or the external device connected to the device that performs the method.
  • each set of candidate block boundaries corresponds to a desired length.
  • These desired lengths include the desired length for each data segment in subsequent steps.
  • step 44 From the plurality of sets of candidate block boundaries in step 43, select a set of candidate boundaries, and sample and compress the candidate block sequences formed by the set of candidate boundaries.
  • the data object is partitioned, and the expected length of the partition is one of a plurality of expected lengths in step 43.
  • the candidate partition corresponding to the desired length is directly used.
  • the boundary can be divided into partitions. Then, the partition data is sample-compressed, and the compression ratio obtained by the sample compression is used as the sampling compression ratio of the entire partition data.
  • the consecutive partitions with the same compression rate feature are aggregated into data segments.
  • this step is to split the data segment into data blocks. .
  • the candidate block boundary of this step 46 is from step 43, the data object is not required to be repeatedly scanned when the data segment is split into data blocks, thereby saving system resources.
  • the execution of the data object is completed after the execution of the data 46, and the effect of splitting the data object into data blocks is realized.
  • Subsequent steps 47, 48, 49 are further extensions to the data object processing method.
  • 47. splicing a block subsequence of adjacent data segments. The subsequences of the data segments are sorted according to the order of the data segments in the data object to form a data block sequence of the data objects. Data chunking can also be called a data chunk.
  • the block compression ratio of the block can be obtained by using the sample compression ratio of the segment to which it belongs.
  • One method is to directly use the segmented sample compression rate as the sample compression ratio of each block contained therein.
  • Embodiment 3 additionally provides Flowchart 5.
  • the flow of each step in FIG. 5 is: 5a, inputting a data object that needs to execute a data object processing method, and the data object may be temporarily stored in the cache before being processed by the executed data object; 5b, determining multiple sets of candidates with different expected lengths Blocking boundary; 5c, in the plurality of candidate block boundaries determined by 5b, selecting one of the candidate block sequences to divide the data object, and sampling the divided block and estimating the compression rate; 5d, will have The consecutive candidate blocks of the common compression rate feature are aggregated into data segments; 5e, the desired length and the corresponding block boundary are selected according to the compression ratio and length characteristics of each data segment, and each data segment is in accordance with the corresponding segment edge The data blocks split into boundaries form a block subsequence; 5f, the block subsequences of each data segment are spliced and the block fingerprint is calculated.
  • the embodiment describes a data object processing apparatus 6, and the method of the first embodiment, the second embodiment, and the third embodiment can be applied.
  • the data object processing apparatus 6 includes: a partition division module 61, a data segment generation module 62, and a data block generation module 63.
  • the partition dividing module 61 is configured to divide the data object into one or more partitions; the data segment generating module 62 is configured to calculate a sampling compression ratio of each partition, and the sampling compression ratio has a common feature.
  • the continuous partition is aggregated into one data segment, and the sampling compression ratio of each data segment is obtained.
  • the data block generating module 63 belongs to the length interval to which the length of each data segment belongs, and the compression ratio of each data segment. a compression ratio interval, selecting a desired length to split the data segment into data blocks, wherein a compression ratio of each of the data segments uniquely belongs to one of the compression ratio intervals, and each of the data segments has a length that belongs to only one The length interval.
  • the partition dividing module 61 may divide the data object into one or more fixed length partitions; or may be used to calculate multiple sets of candidate block boundaries having different desired lengths, and use one set of candidate block boundaries. , dividing the data object into one or more variable length partitions.
  • the data segment generating module 62 may be specifically configured to: calculate a sampling compression ratio of each partition, and combine the sampling compression rate into the same compression rate interval and aggregate the adjacent consecutive partitions into one data segment, and obtain sampling compression of each of the data segments. rate.
  • the data segment generating module 62 may be further configured to: calculate a sampling compression ratio of each partition, and compare the difference between the sampling compression ratios to a specified threshold and the adjacent consecutive partitions into one data segment, and obtain the data segments. Sampling compression ratio.
  • the data block generation module 63 has a function of selecting a desired length to split the data segment into data blocks, and specifically, may be: according to the selected desired length, From among the plurality of sets of candidate block boundaries having different desired lengths calculated by the partitioning module, the block boundaries having the same desired length are selected to split the data segments into data blocks.
  • the data block generating module 63 can also be used to put the last data block of the previous data segment in the adjacent data segment.
  • the first data block of the latter data segment is spliced into a spliced data block.
  • the data block generating module 63 may be further configured to: split the spliced data block into a plurality of data blocks, and use a desired length of the split to be less than or equal to a desired length corresponding to the previous data segment, and The desired length used by the sub-segment is less than or equal to the desired length corresponding to the latter data segment.
  • the object processing device 6 may further include a data block transmitting module 64.
  • the data block sending module 64 is configured to calculate a fingerprint of each of the data blocks, determine, by the fingerprint, whether each of the data blocks is already stored in the storage device, and the data blocks that are not stored in the storage device, and the unstored data.
  • the fingerprint corresponding to the block and the sampling compression rate are sent to the storage device, and the storage device stores the received data block fingerprint, and determines whether the sampling compression rate of the received data block meets the compression rate threshold, and the data block that meets the compression rate threshold is used. Stored after compression.
  • the object processing device 6 can also be regarded as a device composed of a CPU and a memory.
  • the program is stored in the memory, and the CPU executes the methods of the first embodiment, the second embodiment or the third embodiment through the program in the memory. It can also include an interface for connecting to a storage device. The function of the interface can, for example, send data blocks generated after processing by the CPU to the storage device.
  • the present invention can be implemented by means of software plus necessary general hardware, and of course, by hardware, but in many cases the former is a better implementation.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer.
  • a hard disk or optical disk or the like includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Abstract

本发明提供了一种数据对象处理方法和装置,可以把数据对象划分为一个或多个分区;计算每个分区的采样压缩率,把采样压缩率具有共同特征且相邻的连续分区聚合为一个数据段,获取各所述数据段的采样压缩率;根据每个所述数据段的长度所属于的长度区间、每个数据段的采样压缩率所属于的压缩率区间,选择一种期望长度将数据段拆分成数据块,其中,每个所述数据段的采样压缩率唯一属于一个所述压缩率区间,每个所述数据段的长度唯一属于一个所述长度区间。应用本发明提供的技术,可以把数据对象拆分成数据块。

Description

一种数据对象处理方法与装置 技术领域
本发明涉及信息技术领域, 特别有关于一种数据对象处理方法与装置。 背景技术
数据去重( Data Deduplication )是在数据集或数据流中发现和消除重复内 容以提高数据的存储和 /或传输效率的过程,又称重复数据删除( Duplicate Data Elimination ), 简称去重或重删。 去重技术通常把数据集或数据流拆分成一系 列数据单元, 且对于重复的数据单元只保留一份, 从而减少数据存储过程中 的空间开销或传输过程中的带宽消耗。
如何把数据对象划分为易发现重复内容的数据单元, 是需要解决的一个 关键问题。 数据对象划分为数据单元以后, 可以计算数据块的哈希值 h )作为 指纹, 把具有相同指纹的数据单元定义为重复数据。 现有技术常用的去重数 据单元包括文件、 固定长度分块(Block )、基于内容的可变长度分块(Chunk ) 等。 其中, 基于内容的分块方法( Content Defined Chunking, CDC )采用滑动 窗口扫描数据并识别符合预设特征的字节串, 并将字节串所在位置标记为分 块边界, 进而将数据集或数据流拆分为可变长度分块序列, 该方法基于数据 的内容特征选择分块边界, 能够更敏感发现相似文件或数据流所共享的数据 单元, 从而被广泛用于各种数据去重方案。 研究表明, 在采用基于内容的分 块方法拆分数据集或数据流时, 分块粒度越小, 发现重复数据的概率越高, 去重效果也越好; 但分块粒度变小, 意味着划分给定数据集所获得的分块数 量也越多, 从而增加了索引开销和查找重复数据的复杂度, 进而降低数据去 重的时间效率。
期望长度 于内容的分块( Content Defined Chunking, CDC )方法控制 分块粒度的关键参数, 一般情况下, CDC分块方法对给定的数据对象输出可 变长度的分块序列, 各分块的长度在统计上服从正态分布, 期望长度用于调 整正态分布的均值。 通常, 所述正态分布的均值表现为平均分块长度, 由于 正态分布下随机变量取均值的概率最高, 平均分块长度也称峰值长度, 其在 理想情况下可等于期望长度。 举例说明, 在 CDC方法中, 实时计算滑动窗口 内数据的指纹 f(w-bytes), 当 f(w-bytes)的某些位匹配预设值时, 则选择滑动窗 口所在位置为分块边界。 由于数据内容的更新会导致哈希指纹的随机变化, 若设定 f(w-bytes) & OxFFF = 0为匹配条件,其中&为二进制域上的位与运算, OxFFF为 4095的十六进制表示, 则理论上 f(w-bytes)随机变化 4096次会出现 一次指纹匹配, 即滑动窗口每向前滑动 4KB ( 4096字节) 能发现一个分块边 界, 这种在理想条件下的分块长度即为 CDC方法的期望分块长度(Expected Chunk Length ), 简称期望长度。
为了在保持去重空间效率的同时尽量减少分块数量, 现有技术提出一种 基于内容的双峰分块方法。 其核心思想是采用两种不同期望长度的变长分块 模式, 在把文件拆分成数据块时, 通过查询去重存储系统判断候选分块的重 复性, 在重复数据与非重复数据之间的过渡区域采用小分块模式, 而在非过 渡区域采用大分块模式。
然而这种技术无法独立工作, 分块计算设备在确定数据对象如何分块时, 需要频繁的查询去重存储设备中已存在的数据块的指纹, 去重存储设备中存 储有数据去重后的数据块, 根据候选分块的重复性判断是否遇到重复数据与 非重复数据的过渡区域, 进而确定最终采用何种分块模式。 因此, 这种现有 技术给去重存储设备造成了查询负载压力。 发明内容
本发明实施例提供一种数据对象处理技术, 可以把数据对象拆分成数据 块。
第一方面, 本发明实施例提供一种数据对象处理方法, 该方法包括: 把 数据对象划分为一个或多个分区; 计算每个分区的采样压缩率, 把采样压缩 率具有共同特征的连续分区聚合为一个数据段, 获取各所述数据段的采样压 缩率; 根据每个所述数据段的长度所属于的长度区间、 每个数据段的采样压 缩率所属于的压缩率区间,选择一种期望长度将数据段拆分成数据块,其中, 每个所述数据段的采样压缩率唯一属于一个所述压缩率区间, 每个所述数据 段的长度唯一属于一个所述长度区间。
第二方面, 本发明实施例提供一种数据对象处理装置, 该装置包括: 分 区划分模块, 用于把数据对象划分为一个或多个分区; 数据段生成模块, 用 于计算每个分区的采样压缩率, 把采样压缩率具有共同特征的连续分区聚合 为一个数据段, 获取各所述数据段的采样压缩率; 数据块生成模块, 根据每 个所述数据段的长度所属于的长度区间、 每个数据段的采样压缩率所属于的 压缩率区间, 选择一种期望长度将数据段拆分成数据块, 其中, 每个所述数 据段的采样压缩率唯一属于一个所述压缩率区间, 每个所述数据段的长度唯 一属于一个所述长度区间。
应用本发明实施例, 先把数据对象划分为分区, 再根据各个分区的压缩 率把分区聚合为数据段, 接着把拆分成数据块, 实现了把数据对象拆分成数 据块的效果。 附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作简单的介绍, 下面描述中的附图 仅仅是本发明的一些实施例, 还可以根据这些附图获得其他的附图。
图 1是一种数据对象处理方法实施例的流程图;
图 2是一种数据对象处理方法实施例的流程图;
图 3是一种数据对象处理方法实施例的流程图;
图 4是一种数据对象处理方法实施例的流程图;
图 5是一种数据对象处理方法实施例的流程图;
图 6是一种数据对象处理装置实施例的结构图。 具体实施方式
下面将结合本发明实施例中的附图, 对本发明的技术方案进行清楚、 完 整的描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是全部 的实施例。 基于本发明中的实施例所获得的所有其他实施例, 都属于本发明 保护的范围。 降低数据块的期望长度有利于获得更好的去重率, 但同时也会增加数据 块数量和相应的索引量, 进而提高查找重复数据块的复杂度并制约去重性能。
现有技术提出在重复内容与非重复内容的过渡区域采用细粒度分块, 在 其他区域采用粗粒度分块, 从而形成一种双峰的分块方法。 但该方法在分块 过程中需要频繁查询候选分块的重复性, 从而给去重存储系统造成查询负载 压力。 此外, 该方法的分块结果依赖于多版本数据的传输顺序, 不具有稳定 性。
本发明实施例提出一种基于内容和可压缩性的多峰数据去重分块方法, 可以把数据对象拆分成数据块。 具体而言, 应用本发明实施例, 先把数据对 象划分为分区, 再根据各个分区的采样压缩率把分区聚合为数据段, 接着根 据数据段的长度和采样压缩率选择期望长度把各数据段分别拆分成数据块, 实现了把数据对象拆分成具有多峰长度分布的数据块序列的效果。
本发明实施例中, 数据对象是指一段可以被操作的数据, 例如文件或数 据流。 通过分块策略映射表维护压缩率区间、 长度区间和期望分块长度之间 的映射关系, 数据段的压缩率越高、 长度越长, 则对应的期望分块长度也越 大。
本发明实施例方法可以把数据对象拆分成数据块, 拆分后的数据块可以 作为数据去重的单元, 从而可以减少存储空间占用又不丟失数据的前提下对 数据对象进行存储或者传输, 当然, 拆分生成的数据块还可以用于数据去重 以外的其他目的。 本发明实施例包括: 步骤(a ), 数据对象(Data Object )输 入分块设备, 数据对象可以来自于分块设备外也可以来自于分块设备内部的 存储器, 例如是文件或者数据流, 符合去重操作要求即可, 本实施例不做限 定; 步骤(b ), 把数据对象分成一个或多个分区(Block ), 利用采样的方法估 算每个分区的压缩率, 查询分块策略映射表, 把属于同一个压缩率区间的相 邻的连续分区聚合为一个数据段(Segment ), 其中, 采样压缩率是对数据块 可压缩性的衡量, 当对整个分区进行采样时, 采样压缩率和分区的压缩率相 同; 步骤(c ), 对每个数据段, 查询分块策略映射表, 根据其对应的压缩率 区间和长度区间选择一种期望长度, 采用基于内容的分块方法按照选择的期 望长度将数据段划分为分块(Chunk )序列; 步骤 (d), 对位于不同数据段且相 邻的分块进行拼接, 计算各分块的哈希值用于数据去重, 其中拼接步骤属于 可选步骤, 也可以不拼接直接计算哈希值。 步骤(b )利用分区序列的采样压 缩率信息将数据对象重新划分为数据段序列, 划分方法可以有多种, 例如在 另外一种实施方式中, 步骤(b )也可以不参考压缩率区间, 而是把采样压缩 率差值小于指定阈值的相邻的连续分区聚合为一个数据段。
实施例一
参照附图 1 ,下面用具体的步骤对本发明实施例的一种数据对象处理方法 做更详细的介绍。
步骤 11、 把待去重的数据对象划分成一个或多个分区, 计算各分区的采 样压缩率, 把具有共同压缩率特征且相邻的连续分区聚合为一个数据段, 获 取各所述数据段的采样压缩率。 其中, 各分区可以是定长的, 也可以是不定 长的。 在不定长的情况下, 可以在一定范围内选择一个随机长度作为分区长 度; 也可以扫描数据对象并输出多组具有不同期望长度的候选分块边界, 使 用其中一组候选分块边界, 把所述数据对象划分为一个或多个变长分区。
压缩率用于度量数据可以被压缩的程度, 其计算方法为: 压缩率 =压缩 后数据量 /原始数据量。 一种估算压缩率的方法是: 对每个分区, 按采样比 例 S抽取一段样本数据, 用数据压缩算法压缩样本数据, 例如无损压缩算法 LZ算法、 RLE编码, 计算压缩率, 并把样本的压缩率作为被采样分区的压缩 率。 采样比例越高, 样本的压缩率越接近被采样分区的真实压缩率。
对于可压缩的数据段, 其压缩率一般小于 1 ; 对于不可压缩的数据段, 由 于增加了描述字段等元数据开销, 其进行压缩编码后的长度有可能大于原始 数据长度, 从而可以出现压缩率大于 1 的情况。 压缩率表项包含一系列的压 缩率区间, 不同的压缩率区间之间交集为空, 且所有压缩率区间的并集构成 完整的压缩率取值范围 [0,∞)。
压缩率特征是指把压缩率作为聚合分区的参数, 压缩率或者对压缩率进 行运算获得的值符合预设条件的分区, 就是符合压缩率特征的分区。 具体而 言, 压缩率特征可以是压缩率区间, 也就是压缩率的范围; 也可以是相邻分 区的压缩率的差值的阈值。参照表 1示例,是以压缩率区间作为压缩率特征。 数据对象划分为分区 1一分区 7共计 7个分区。通过对各个分区采样并估算压 缩率, 获得各个分区的采样压缩率, 例如分区 1的采样压缩率是 0.4, 分区 2 的采样压缩率是 0.42; 每个采样压缩率属于一个压缩率区间, 例如压缩率 0.4 属于压缩率区间 [0, 0.5),压缩率 0.61属于压缩率区间 [0.5, 0.8), 因为每个分区 对应有一个采样压缩率, 也可以认为每个分区属于一个压缩率区间; 属于同 一个压缩率区间的分区聚合为一个数据段, 因此分区 1、 分区 2可聚合为数据 段 1 , 分区 3、 分区 4以及分区 5可聚合为数据段 2, 分区 6、 分区 7可聚合 为数据段 3。
Figure imgf000008_0001
表 1
步骤 12, 根据每个数据段的长度所属于的长度区间、 每个数据段的采样 压缩率所属于的压缩率区间, 选择一种期望长度将数据段拆分成数据块。 其 中, 所述数据段的长度取值范围划分为至少一个所述长度区间, 每个所述数 据段的长度唯一属于一个所述长度区间, 每个所述数据段的采样压缩率唯一 属于一个所述压缩率区间。
不同的压缩率区间不相交, 不同的长度区间也不相交, 他们的每种组合 对应一种期望长度, 每个数据段的采样压缩率唯一属于一个压缩率区间, 每 个数据段的长度唯一属于一个长度区间。 数据段的采样压缩率可以通过对数 据段进行采样压缩获得; 也可以通过对组成数据段的各分区的采样压缩率取 平均值获得, 取平均值可以是算术平均值; 也可以按照各分区的长度, 计算 各分区采样压缩率的加权平均值, 具体方法是以组成所述数据段的各分区的 采样压缩率作为标志值, 各分区的长度作为权重, 计算各分区采样压缩率的 加权平均值。
在把数据段拆分成数据块时, 需要找到数据块的边界, 相邻边界之间就 是一个数据块。
如果步骤 11中分区划分方法是, 计算多组具有不同期望长度的候选分块 边界, 使用其中一组候选分块边界, 把所述数据对象划分为一个或多个变长 分区, 则本步骤可以按照选择的所述期望长度, 从所述候选分块边界中, 选 择具有相同期望长度的分块边界把数据段拆分成数据块, 也就是说从步骤 11 中扫描获得的多组候选边界中选择相应的边界即可, 不用再次扫描数据段。 如果步骤 11中, 分区划分是其他方法, 例如定长分区, 本步骤可以是选择一 种期望长度, 然后扫描数据段找到各个分块的边界, 从而将数据段拆分成数 据块, 相较于计算多个具有不同期望长度的候选分块边界的方法, 多了一个 扫描数据段寻找数据块边界的步骤。
参照表 2的示例,压缩率的取值范围被划分成 [0, 0.5) 、 [0.5, 0.8) 、 [0.8,
00)三个压缩率区间, 压缩率区间之间没有交集, 每个数据段的采样压缩率都 属于一个压缩率区间,也可以理解为每个数据段对应一个压缩率区间。此外, 数据段的长度取值范围划分为至少一个长度区间, 长度区间之间没有交集, 因此每个数据段对应一个长度区间, 长度区间和压缩率区间共同决定一个数 据段的期望长度。 例如数据段 A属于的长度区间是长度 [0MB, 10 MB), 且数 据段 A的采样压缩率在 [0, 0.5)内, 而长度区间 [0MB, 10 MB)、 压缩率区间 [0, 0.5)共同确定的期望长度是 32KB, 那么数据段 A的期望长度就是 32 KB; 同 样的道理, 由数据段 B的长度所属于的长度区间、 以及数据段 B的采样压缩 率所属于的压缩率范围, 可以得出数据段 B的期望长度是 256KB。 获得了期 望长度后, 就可以把各数据段按照期望长度拆分成数据块。例如,把数据段 A 按照期望长度 32KB拆成数据块, 把数据段 B按照期望长度 256KB拆成数据 块。 其中, B是字节 (Byte ) 的缩写, 1KB = 1024 Bytes, 1MB = 1024KB。 由于期望长度是一个理论值, 各个数据块的实际长度可能不一样, 也就 是说是不定长的, 但数据段 A拆分形成的数据块中, 平均分块长度接近于期 望长度 32KB, 这些不定长的数据块长度取值为 32KB的概率最高, 32KB也 是一个峰值长度。 本发明实施例中, 每个期望长度对应一个峰值长度, 因此 本发明实施例是一种多峰值长度的数据分块与去重方法。
Figure imgf000010_0001
表 2
本实施例中, 期望长度可以用经验值或者通过分析统计获得。 一种可选 的期望长度的确定规则, 如表 2示例所示, 对于相同的压缩率区间, 则随着 长度区间下限的增加, 对应的期望长度也递增; 另一方面, 对于相同的长度 区间,随着压缩率区间下限的增加,其在该映射表中对应的期望长度也递增。 数据段的期望分块长度与数据段采样压缩率所属的压缩率区间下限值、 数据 段长度所属的长度区间下限正相关。 一些情况下, 大数据对象和难以压缩的 数据对象的去重率对分块粒度不敏感, 采用这种期望长度的分块方式, 可以 快速减少分块数量而不会导致去重率快速恶化。
按照各个数据块在数据段中的位置, 每个数据段拆分形成的数据块具有 前后顺序, 这些具有前后顺序的数据块形成一个数据块序列, 也可以称为分 块子序列。 本步骤 12还可以包括把不同数据块拆分生成的分块子序列拼接起 来, 形成数据对象的分块序列, 分块序列中各个分块的前后顺序, 与分块数 据在数据对象中的前后顺序相同。 步骤 13 , 把相邻数据段中, 前一数据段的末端数据块与后一数据段的首 个数据块拼接成 1 个数据块, 拼接形成的数据块可以称为拼接数据块。 步骤
13是可选步骤, 例如, 如果步骤 11 中数据段的边界来自于定长分区的边界, 可以选用步骤 13 , 拼接后可以避免固定长度分区的边界被沿用成为分块的边 界, 可以获得更好的去重效果。
此外, 两个数据段之间过渡区域所产生的拼接数据块, 其两侧相邻数据 块分别对应不同的期望长度, 可选的, 可以采用两种期望长度的较小值把拼 接数据块拆分为更细粒度的数据块, 可以更好的发现两个数据段之间过渡区 域的重复内容, 在数据块总量增加不多的情况下, 提高去重效果。 在其他实 施例中, 也可以采用比两种期望长度的较小值更小的期望长度, 把拼接数据 块拆分为更细粒度的数据块。
当步骤 12中包括拼接分块子序列的步骤时, 步骤 13可以在拼接分块子 序列之前执行, 也可以在拼接分块子序列之后执行。
步骤 14, 计算各数据块的指纹, 通过指纹判断数据块是否已经在存储设 备中存储,把存储设备中未存储的数据块,以及未存储的数据块对应的指纹、 采样压缩率发送给存储设备。
指纹用于唯一标识数据块, 数据块和数据块对应的指纹有——对应的关 系。 一种指纹计算方法是计算数据块的哈希 (Hash )值作为指纹。
步骤 15 , 存储设备对收到的数据、 数据块的指纹进行存储。 在存储时, 可以判断收到的数据块的采样压缩率是否符合压缩率阈值, 把符合压缩率阈 值的数据块压缩后存储, 以节约存储空间; 不符合压缩率阈值的数据不做压 缩直接存储。 例如压缩率阈值是小于等于 0.7, 则可以对采样压缩率小于等于 0.7的数据块压缩后存储, 采样压缩率大于 0.7的数据不压缩直接存储。
需要说明的是, 步骤 11的分段方式不止一种。 例如在另外一种实施方式 中, 可以把步骤 11 的分段策略修改为: 把采样压缩率的差值小于指定阈值、 并且相邻的连续分区聚合为同一个数据段。 也就是说采样压缩率的差值小于 指定阈值是压缩率的特征。仍以表 1举例,假如以采样压缩率差值小于 0.1作 为阈值。 那么 0.42 - 0.4 < 0.1 , 因此分区 1与分区 2划分到同一个数据段中; 而 0.53 - 0.42 > 0.1 ,因此分区 3不和分区 2划分到同一个数据段中。依次类推, 分区 1与分区 2划分到第一个数据段,分区 3与分区 4划分到第二个数据段, 分区 5、 分区 6、 分区 7划分到第三个数据段。
实施例二
如图 2所示对实施例二进行详细描述, 本实施例数据对象处理方法包括 以下步骤。
21、 加载分块策略映射表, 参照表 1、 表 2, 分块策略映射表记录了哪些 压缩率属于同一个压缩率特征; 此外, 分块策略映射表中还记录了数据段所 属于的长度区间、 数据段的压缩率所属于的压缩率区间, 以及由二者共同确 定的期望长度。 期望长度用于把数据段拆分为数据块。 本步骤也可以在后续 执行, 在需要使用分块策略映射表之前加载即可。
22、 输入需要处理的数据对象给执行本方法的装置。 也就是获取需要处 理的数据对象, 获取的来源可以是执行本方法的装置, 也可以是与执行本方 法装置连接的外部装置。
23、 用固定长度对数据对象进行分区。 在具体实现时, 本步骤需要扫描 数据对象, 获得各个固定长度分区的边界, 相邻两个边界之间的数据就是一 个分区。
24、 对各分区数据进行采样压缩, 以采样压缩获得的压缩率作为整个分 区数据的采样压缩率。
25、 根据分块策略映射表, 把具有相同压缩率特征的连续分区聚合为数 据段。
26、 根据分块策略映射表, 把各数据段划分为具有特定期望长度的数据 块。 在具体实现时, 本步骤需要扫描各数据段, 找到各个数据块的边界, 以 边界拆分数据段, 形成一系列的数据块。
本实施例, 执行到数据 26即完成了数据对象的处理, 实现了把数据对象 拆分成数据块的效果。 后续步骤 27、 28、 29是对数据对象处理方法的进一步 拓展。
27、 拼接各数据段产生的分块子序列。 每个数据段拆分形成的数据块具 有先后次序, 其次序与数据块在数据段中的位置相同, 这些具有先后次序的 数据块也可以称为分块子序列。 本步骤一种拼接方式是, 对任意相邻的两个 数据段, 把前一数据段的末尾数据块和后一数据段的首个数据块拼接成一个 数据块, 从而消除数据段之间的边界, 把多个子序列合并, 形成整个数据对 象的数据块序列; 可选的, 还可以对本步骤拼接成的数据块拆分成更小粒度 的数据块, 提高去重的效果。 另外一种拼接方式是, 各数据块保持不变, 按 照数据段在数据对象中的顺序, 把各数据段的分块子序列进行排序, 下一个 数据段首数据块承接前一个数据段的末尾数据块, 形成整个数据对象的数据 块序列。
28、 计算各分块的哈希值作为指纹。
29、 输出各分块及分块的指纹, 可选的还可以输出分块的采样压缩率。 其中, 分块的采样压缩率可利用其所属分段的采样压缩率计算获得, 一种方 式是直接使用分段的采样压缩率作为其所含各分块的采样压缩率。
当数据对象不止一个时, 对上述步骤(22 ) - ( 29 )进行循环, 直至所有 数据对象处理完成。
上述步骤的一种具体实现方法是: 采用 w-bytes滑动窗口扫描数据对象, 每向前滑动一个字节则采用快速哈希算法重新计算窗口内数据的指纹 f(w-bytes), 若当前数据段的期望长度为 E, 则以表达式 Match(f(w-bytes), E) = D是否成立判断当前指纹是否符合分块边界筛选条件, 其中整数 D e [Ο, Ε)为 预定义的特征值。 函数 Match )用于将 f(w-bytes)映射到区间 [0, Ε), 由于哈希 函数 f )具有随机性, 该筛选条件能够输出峰值长度为 E的一系列分块。 该流 程在到达新数据段时, 由于新数据段的期望长度 E发生了改变, 因此输出具 有新的峰值长度的分块子序列。 在对两个数据段的相邻区域进行分块时, 数 据块的前一边界由前一个数据段的期望长度确定, 后一边界由后一个数据段 的期望长度确定, 因此这个数据块的前后边界分别位于两个数据段中, 相当 于自动完成了相邻段之间分块子序列的拼接, 因此该流程不会引入段边界作 为分块边界。
为了方便理解, 实施例二另外提供流程图 3 , 其步骤流程依次是: 3a, 输 入需要执行数据对象处理方法的数据对象, 这个数据对象在被执行数据对象 处理前, 可以暂存在緩存中; 3b, 对数据对象进行分区、 采样并估算压缩率, L、 M、 H分别代表三种不同的压缩率区间; 3c, 将具有共同压缩率特征的连 续分区聚合为数据段, 3d, 根据各数据段的压缩率和长度特征选择期望长度 并计算分块边界, 按照计算出的分块边界把各数据段拆分成数据块, 每个数 据段拆分成的数据块组成一个分块子序列; 3e, 拼接各数据段的分块子序列 并计算分块指纹。
实施例三
实施例二把待去重的数据对象划分成多个定长的分区。 而实施例三中, 采用期望长度对待去重的数据对象进行预分块, 具体而言是采用基于内容的 分块方法扫描数据对象(文件 /数据流) 并生成多组具有不同期望长度的候选 分块边界, 其中每组候选分块边界对应数据对象的一种候选分块序列, 后续 分块序列也可以理解成分块方案。 用其中一种预分块方案把数据对象划分为 分区, 用于进行采样压缩和确定数据段的划分方法, 在后续的把数据段拆分 成数据块的步骤中, 可以根据期望长度, 选择相应的候选分块边界用于拆分 数据块。
因此, 实施例二在把数据对象划分成分区、 把数据段拆分成数据块时都 需要扫描数据对象。而本实施例只需要扫描一次数据对象,节约了系统资源, 提高了数据对象处理效率。 此外, 由于实施例三中分区边界与数据块边界都 是基于候选分块边界, 因此由分区聚合而产生的段边界不会对去重造成不良 影响, 不需要对数据块进行去除段边界的操作, 也就是说不需要把相邻数据 段中前一数据段的末尾数据块与后一数据段的首个数据块拼接成一个数据块。
期望长度越小则平均分块粒度越细, 越有利于感知数据内容局部压缩率 的变化, 期望长度越大, 则平均分块粒度越粗。 如图 4所示, 本实施例数据对象处理方法, 具体可以包括以下步骤。
41、 加载分块策略映射表, 参照表 1、 表 2, 分块策略映射表记录了哪些 压缩率属于同一个压缩率特征; 此外, 分块策略映射表中还记录了数据段所 属于的长度区间、 数据段的压缩率所属于的压缩率区间, 共同确定的期望长 度, 这个期望长度用于把数据段拆分为数据块。 本步骤也可以在后续执行, 在需要使用分块策略映射表之前加载即可。
42、 输入需要处理的数据对象给执行本方法的装置。 也就是获取需要处 理的数据对象, 获取的来源可以是执行本方法的装置, 也可以是与执行本方 法装置连接的外部装置。
43、 在一次扫描过程中, 输出多组候选分块边界, 每组候选分块边界与 一个期望长度——对应。 这些期望长度包括了后续步骤中各个数据段对应的 期望长度。
44、 从步骤 43的多组候选分块边界中, 选择一组候选边界, 并对由这组 候选边界形成的候选分块序列进行采样压缩。 本步骤也就是对数据对象进行 分区, 分区的期望长度是步骤 43中多个期望长度中的一个, 在按照这个期望 值分区时, 不需要再次扫描数据对象, 直接按照这个期望长度对应的候选分 块边界划分分区即可。 然后对各分区数据进行采样压缩, 以采样压缩获得的 压缩率作为整个分区数据的采样压缩率。
45、 根据分块策略映射表, 把具有相同压缩率特征的连续分区聚合为数 据段。
46、 根据分块策略映射表, 为各数据段选择相应期望长度对应的候选分 块边界, 数据块的边界决定了数据块的长度和位置, 因此本步骤也就是把数 据段拆分成数据块。 如前所述, 由于本步骤 46的候选分块边界来自步骤 43, 因此在把数据段拆分成数据块时不需要重复扫描数据对象, 节约了系统资源。
本实施例, 执行到数据 46即完成了数据对象的处理, 实现了把数据对象 拆分成数据块的效果。 后续步骤 47、 48、 49是对数据对象处理方法的进一步 拓展。 47、 拼接相邻数据段的分块子序列。 按照数据段在数据对象中的顺序, 把各数据段的子序列进行排序, 形成数据对象的数据分块序列。 数据分块也 可以称为数据块。
48、 计算各数据块的哈希值作为指纹。
49、 输出各分块及分块的指纹, 可选的还可以输出分块的采样压缩率。 其中, 分块的采样压缩率可利用其所属分段的采样压缩率计算获得, 一种方 式是直接使用分段的采样压缩率作为其所含各分块的采样压缩率。
在存储数据分块时, 可以根据分块的采样压缩率判断是否要对数据块进 行压缩后再存储。
当数据没有处理完, 也就是说数据对象不止一个时, 对余下的对象依次 执行步骤 42-49, 直至所有数据对象处理完成。
上述步骤 43的一种具体实现方法是: 该流程将表 2所示分块策略映射表 中的所有期望长度 及对应的特征值 构造成参数列表, 对滑动窗口输出的 每个指纹 f(w-bytes) , 依次用不同的参数 < 判断是否符合匹配条件 Match(f(w-bytes), E1) = D1,若是, 则选择当前窗口位置为 对应的候选分块边 界。 其中, 通过优化参数 < 可以提高指纹匹配效率。 例如, 定义 Match(f(w-bytes), ) = f(w-bytes) mod ¾, 并选择 E。 = 212B = 4 KB, Ei = 215B = 32 KB, D0 = D! = 0,则当 f(w-bytes) mod E0≠ D。时,必然出现 f(w-bytes) mod Ej ≠ Dj , 即一个指纹如果不满足 4KB期望长度对应的筛查条件, 则不需要继续 检查其是否符合 32KB期望长度对应的筛查条件。
为了方便理解, 实施例三另外提供流程图 5。 图 5中各步骤流程依次是: 5a, 输入需要执行数据对象处理方法的数据对象, 这个数据对象在被执行数 据对象处理前, 可以暂存在緩存中; 5b , 用不同的期望长度确定多组候选分 块边界; 5c, 在 5b确定出的多组候选分块边界中, 选择其中一组候选分块序 列划分数据对象, 并对划分后得到的分块进行采样并估算压缩率; 5d, 将具 有共同压缩率特征的连续候选分块聚合为数据段; 5e , 根据各数据段的压缩 率和长度特征选择期望长度和相应分块边界, 每个数据段按照相应的分块边 界拆分成的数据块组成一个分块子序列; 5f,拼接各数据段的分块子序列并计 算分块指纹。
实施例四
参照图 6, 本实施例描述一种数据对象处理装置 6, 可以应用上述实施例 一、 实施例二、 实施例三的方法。 数据对象处理装置 6 包括: 分区划分模块 61、 数据段生成模块 62以及数据块生成模块 63。
在对象处理装置 6中, 分区划分模块 61 , 用于把数据对象划分为一个或 多个分区; 数据段生成模块 62, 用于计算每个分区的采样压缩率, 把采样压 缩率具有共同特征的连续分区聚合为一个数据段, 获取各所述数据段的采样 压缩率; 数据块生成模块 63 ,根据每个所述数据段的长度所属于的长度区间、 每个数据段的压缩率所属于的压缩率区间, 选择一种期望长度将数据段拆分 成数据块, 其中, 每个所述数据段的压缩率唯一属于一个所述压缩率区间, 每个所述数据段的长度唯一属于一个所述长度区间。
具体而言, 分区划分模块 61 , 可以把所述数据对象划分为一个或多个定 长分区; 也可以用于计算多组具有不同期望长度的候选分块边界, 使用其中 一组候选分块边界, 把所述数据对象划分为一个或多个变长分区。
数据段生成模块 62, 具体可以用于: 计算每个分区的采样压缩率, 把采 样压缩率属于相同压缩率区间并且相邻的连续分区聚合为一个数据段, 获取 各所述数据段的采样压缩率。
数据段生成模块 62, 具体还可以用于: 计算每个分区的采样压缩率, 把 采样压缩率的差值小于指定阈值并且相邻的连续分区聚合为一个数据段, 获 取各所述数据段的采样压缩率。
在上述数据对象处理装置 6各种可能的实现方案中, 数据块生成模块 63 具有的选择一种期望长度将数据段拆分成数据块的功能, 具体可以是: 按照 选择的所述期望长度, 从分区模块计算出的多组具有不同期望长度的候选分 块边界中, 选择具有相同期望长度的分块边界把数据段拆分成数据块。
数据块生成模块 63还可以用于把相邻数据段中, 前一数据段的末数据块 与后一数据段的首数据块拼接成一个拼接数据块。 进一步的, 数据块生成模 块 63还可以用于: 把所述拼接数据块拆分成多个数据块, 拆分所使用的期望 长度小于或等于所述前一数据段对应的期望长度, 且拆分所使用的期望长度 小于或等于所述后一数据段对应的期望长度。
此外, 对象处理装置 6中, 还可以进一步包括数据块发送模块 64。 数据 块发送模块 64用于计算各所述数据块的指纹, 通过指纹判断在存储设备中是 否已经存储有各所述数据块, 把所述存储设备中未存储的数据块, 以及未存 储的数据块对应的指纹、 采样压缩率发送给所述存储设备, 存储设备存储接 收到的数据块指纹, 并判断收到的数据块的采样压缩率是否符合压缩率阈值, 把符合压缩率阈值的数据块压缩后存储。
对象处理装置 6也可以看做由 CPU、 内存组成的设备。 内存中存储有程 序, CPU通过内存中的程序执行实施例一、实施例二或者实施例三种的方法。 还可以包括接口,接口用于和存储设备连接。接口的功能例如可以把经过 CPU 处理后产生的数据块发送给存储设备。
通过以上的实施方式的描述, 可以清楚地了解到本发明可借助软件加必 需的通用硬件的方式来实现, 当然也可以通过硬件, 但很多情况下前者是更 佳的实施方式。 基于这样的理解, 本发明的技术方案本质上或者说对现有技 术做出贡献的部分可以以软件产品的形式体现出来, 该计算机软件产品存储 在可读取的存储介质中, 如计算机的软盘, 硬盘或光盘等, 包括若干指令用 以使得一台计算机设备(可以是个人计算机, 服务器, 或者网络设备等)执 行本发明各个实施例所述的方法。
以上所述的具体实施方式, 对本发明的目的、 技术方案和有益效果进行 了详细说明, 但本发明的保护范围并不局限于此, 任何人员在本发明揭露的 技术范围内, 所做的任何修改、 等同替换、 改进等均应包含在本发明的保护 范围之内。 因此, 本发明的保护范围应以所述权利要求的保护范围为准。

Claims

权利 要求 书
1、 一种数据对象处理方法, 其特征在于, 该方法包括:
把数据对象划分为一个或多个分区;
计算每个分区的采样压缩率, 把采样压缩率具有共同特征的连续分区聚合 为一个数据段, 获取各所述数据段的采样压缩率;
根据每个所述数据段的长度所属于的长度区间、 每个数据段的采样压缩率 所属于的压缩率区间, 选择一种期望长度将数据段拆分成数据块, 其中, 每个 所述数据段的采样压缩率唯一属于一个所述压缩率区间, 每个所述数据段的长 度唯一属于一个所述长度区间。
2、 根据权利要求 1所述的方法, 其特征在于, 所述把数据对象划分为一个 或多个分区, 具体是:
把所述数据对象划分为一个或多个定长分区。
3、根据权利要求 1所述的方法,其特征在于,所述方法之后,进一步包括: 把相邻数据段中, 前一数据段的末数据块与后一数据段的首数据块拼接成 一个拼接数据块。
4、 根据权利要求 3所述的方法, 其特征在于, 所述方法进一步包括: 把所述拼接数据块拆分成多个数据块, 拆分所使用的期望长度小于或等于 所述前一数据段对应的期望长度, 且拆分所使用的期望长度小于或等于所述后 一数据段对应的期望长度。
5、 根据权利要求 1所述的方法, 其特征在于, 所述把数据对象划分为一个 或多个分区, 具体是:
计算多组具有不同期望长度的候选分块边界, 使用其中一组候选分块边界, 把所述数据对象划分为一个或多个变长分区。
6、 根据权利要求 1或权利要求 5所述的方法, 其特征在于, 所述选择一种 期望长度将数据段拆分成数据块, 具体是:
按照选择的所述期望长度, 从所述多组候选分块边界中, 选择具有相同期 望长度的分块边界把数据段拆分成数据块。
7、 根据权利要求 1所述的方法, 其特征在于, 所述把采样压缩率具有共同 特征的连续分区聚合为一个数据段, 具体是:
把采样压缩率属于相同压缩率区间并且相邻的连续分区聚合为一个数据段。
8、 根据权利要求 1所述的方法, 其特征在于, 所述把采样压缩率具有共同 特征的连续分区聚合为一个数据段, 具体是:
把采样压缩率的差值小于指定阈值并且相邻的连续分区聚合为一个数据段。
9、 根据权利要求 1或 3所述的方法, 其特征在于, 所述方法之后, 进一步 包括:
计算各所述数据块的指纹, 通过指纹判断在存储设备中是否已经存储有各 所述数据块, 把所述存储设备中未存储的数据块, 以及未存储的数据块对应的 指纹、 采样压缩率发送给所述存储设备;
所述存储设备存储接收到的数据块指纹, 并判断收到的数据块的采样压缩 率是否符合压缩率阈值, 把符合压缩率阈值的数据块压缩后存储。
10、 根据权利要求 9所述的方法, 其特征在于,
把所述数据块所来自的所述分段的采样压缩率, 作为所述数据块的采样压 缩率。
11、 根据权利要求 1 所述的方法, 其特征在于, 所述数据段的采样压缩率 具体为:
计算组成所述数据段的各分区的采样压缩率的算术平均值获得; 或者 以组成所述数据段的各分区的采样压缩率作为标志值, 各分区的长度作为 权重, 计算各分区采样压缩率的加权平均值。
12、 一种数据对象处理装置, 其特征在于, 该装置包括:
分区划分模块, 用于把数据对象划分为一个或多个分区;
数据段生成模块, 用于计算每个分区的采样压缩率, 把采样压缩率具有共 同特征的连续分区聚合为一个数据段, 获取各所述数据段的采样压缩率;
数据块生成模块, 根据每个所述数据段的长度所属于的长度区间、 每个数 据段的采样压缩率所属于的压缩率区间, 选择一种期望长度将数据段拆分成数 据块, 其中, 每个所述数据段的采样压缩率唯一属于一个所述压缩率区间, 每 个所述数据段的长度唯一属于一个所述长度区间。
13、 根据权利要求 12所述的装置, 其特征在于, 所述分区划分模块, 具体 用于:
把所述数据对象划分为一个或多个定长分区。
14、 根据权利要求 12所述的装置, 其特征在于, 数据块生成模块还用于: 把相邻数据段中, 前一数据段的末数据块与后一数据段的首数据块拼接成 一个拼接数据块。
15、 根据权利要求 14所述的装置, 其特征在于, 数据块生成模块还用于: 把所述拼接数据块拆分成多个数据块, 拆分所使用的期望长度小于或等于 所述前一数据段对应的期望长度, 且拆分所使用的期望长度小于或等于所述后 一数据段对应的期望长度。
16、 根据权利要求 12所述的装置, 其特征在于, 所述分区划分模块, 具体 还用于:
计算多组具有不同期望长度的候选分块边界, 使用其中一组候选分块边界, 把所述数据对象划分为一个或多个变长分区。
17、 根据权利要求 12或 16所述的装置, 其特征在于, 所述选择一种期望 长度将数据段拆分成数据块, 具体是:
按照选择的所述期望长度, 从所述多组候选分块边界中, 选择具有相同期 望长度的分块边界把数据段拆分成数据块。
18、 根据权利要求 12所述的装置, 其特征在于, 所述数据段生成模块, 具 体用于:
计算每个分区的采样压缩率, 把采样压缩率属于相同压缩率区间并且相邻 的连续分区聚合为一个数据段, 获取各所述数据段的采样压缩率。
19、 根据权利要求 12所述的装置, 其特征在于, 所述数据段生成模块, 具 体用于:
计算每个分区的采样压缩率, 把采样压缩率的差值小于指定阈值并且相邻 的连续分区聚合为一个数据段, 获取各所述数据段的采样压缩率。
20、 根据权利要求 12所述的装置, 其特征在于, 所述装置, 进一步包括: 数据块发送模块, 用于计算各所述数据块的指纹, 通过指纹判断在存储设 备中是否已经存储有各所述数据块, 把所述存储设备中未存储的数据块, 以及 未存储的数据块对应的指纹、 采样压缩率发送给所述存储设备;
所述存储设备存储接收到的数据块指纹, 并判断收到的数据块的采样压缩 率是否符合压缩率阈值, 把符合压缩率阈值的数据块压缩后存储。
PCT/CN2013/081757 2013-08-19 2013-08-19 一种数据对象处理方法与装置 WO2015024160A1 (zh)

Priority Applications (9)

Application Number Priority Date Filing Date Title
KR1020157021462A KR101653692B1 (ko) 2013-08-19 2013-08-19 데이터 오브젝트 처리 방법 및 장치
RU2015139685A RU2626334C2 (ru) 2013-08-19 2013-08-19 Способ и устройство обработки объекта данных
CA2898667A CA2898667C (en) 2013-08-19 2013-08-19 Data object processing method and apparatus
PCT/CN2013/081757 WO2015024160A1 (zh) 2013-08-19 2013-08-19 一种数据对象处理方法与装置
EP13892074.9A EP2940598B1 (en) 2013-08-19 2013-08-19 Data object processing method and device
CN201380003213.3A CN105051724B (zh) 2013-08-19 2013-08-19 一种数据对象处理方法与装置
JP2015561907A JP6110517B2 (ja) 2013-08-19 2013-08-19 データオブジェクト処理方法及び装置
BR112015023973-0A BR112015023973B1 (pt) 2013-08-19 Método e aparelho de processamento de objeto de dados
US14/801,421 US10359939B2 (en) 2013-08-19 2015-07-16 Data object processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/081757 WO2015024160A1 (zh) 2013-08-19 2013-08-19 一种数据对象处理方法与装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/801,421 Continuation US10359939B2 (en) 2013-08-19 2015-07-16 Data object processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2015024160A1 true WO2015024160A1 (zh) 2015-02-26

Family

ID=52482915

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/081757 WO2015024160A1 (zh) 2013-08-19 2013-08-19 一种数据对象处理方法与装置

Country Status (8)

Country Link
US (1) US10359939B2 (zh)
EP (1) EP2940598B1 (zh)
JP (1) JP6110517B2 (zh)
KR (1) KR101653692B1 (zh)
CN (1) CN105051724B (zh)
CA (1) CA2898667C (zh)
RU (1) RU2626334C2 (zh)
WO (1) WO2015024160A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306974A (zh) * 2019-07-30 2021-02-02 深信服科技股份有限公司 一种数据处理方法、装置、设备及存储介质

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430383B1 (en) * 2015-09-30 2019-10-01 EMC IP Holding Company LLC Efficiently estimating data compression ratio of ad-hoc set of files in protection storage filesystem with stream segmentation and data deduplication
CN106445412A (zh) * 2016-09-14 2017-02-22 郑州云海信息技术有限公司 一种数据卷压缩率的评估方法和系统
CN107014635B (zh) * 2017-04-10 2019-09-27 武汉轻工大学 粮食均衡抽样方法及装置
CN107203496B (zh) * 2017-06-01 2020-05-19 武汉轻工大学 粮食分配抽样方法及装置
US11010233B1 (en) 2018-01-18 2021-05-18 Pure Storage, Inc Hardware-based system monitoring
CN110413212B (zh) * 2018-04-28 2023-09-19 伊姆西Ip控股有限责任公司 识别待写入数据中的可简化内容的方法、设备和计算机程序产品
US11232075B2 (en) * 2018-10-25 2022-01-25 EMC IP Holding Company LLC Selection of hash key sizes for data deduplication
CN111722787B (zh) * 2019-03-22 2021-12-03 华为技术有限公司 一种分块方法及其装置
CN111831297B (zh) * 2019-04-17 2021-10-26 中兴通讯股份有限公司 零差分升级方法及装置
CN110113614B (zh) * 2019-05-13 2022-04-12 格兰菲智能科技有限公司 图像处理方法及图像处理装置
US11157189B2 (en) * 2019-07-10 2021-10-26 Dell Products L.P. Hybrid data reduction
US11615185B2 (en) 2019-11-22 2023-03-28 Pure Storage, Inc. Multi-layer security threat detection for a storage system
US11625481B2 (en) 2019-11-22 2023-04-11 Pure Storage, Inc. Selective throttling of operations potentially related to a security threat to a storage system
US11941116B2 (en) 2019-11-22 2024-03-26 Pure Storage, Inc. Ransomware-based data protection parameter modification
US11341236B2 (en) 2019-11-22 2022-05-24 Pure Storage, Inc. Traffic-based detection of a security threat to a storage system
US20210382992A1 (en) * 2019-11-22 2021-12-09 Pure Storage, Inc. Remote Analysis of Potentially Corrupt Data Written to a Storage System
US11687418B2 (en) 2019-11-22 2023-06-27 Pure Storage, Inc. Automatic generation of recovery plans specific to individual storage elements
US11675898B2 (en) 2019-11-22 2023-06-13 Pure Storage, Inc. Recovery dataset management for security threat monitoring
US11651075B2 (en) 2019-11-22 2023-05-16 Pure Storage, Inc. Extensible attack monitoring by a storage system
US11657155B2 (en) 2019-11-22 2023-05-23 Pure Storage, Inc Snapshot delta metric based determination of a possible ransomware attack against data maintained by a storage system
US11645162B2 (en) 2019-11-22 2023-05-09 Pure Storage, Inc. Recovery point determination for data restoration in a storage system
US11500788B2 (en) 2019-11-22 2022-11-15 Pure Storage, Inc. Logical address based authorization of operations with respect to a storage system
US11755751B2 (en) 2019-11-22 2023-09-12 Pure Storage, Inc. Modify access restrictions in response to a possible attack against data stored by a storage system
US11720692B2 (en) 2019-11-22 2023-08-08 Pure Storage, Inc. Hardware token based management of recovery datasets for a storage system
US11520907B1 (en) * 2019-11-22 2022-12-06 Pure Storage, Inc. Storage system snapshot retention based on encrypted data
US11720714B2 (en) 2019-11-22 2023-08-08 Pure Storage, Inc. Inter-I/O relationship based detection of a security threat to a storage system
TWI730600B (zh) * 2020-01-21 2021-06-11 群聯電子股份有限公司 資料寫入方法、記憶體控制電路單元以及記憶體儲存裝置
KR102337673B1 (ko) * 2020-07-16 2021-12-09 (주)휴먼스케이프 데이터 열람 검증 시스템 및 그 방법
JP2022030385A (ja) * 2020-08-07 2022-02-18 富士通株式会社 情報処理装置及び重複率見積もりプログラム
CN112102144B (zh) * 2020-09-03 2023-08-22 海宁奕斯伟集成电路设计有限公司 压缩数据的排布方法、装置和电子设备
CN112560344B (zh) * 2020-12-14 2023-12-08 北京云歌科技有限责任公司 一种构建模型伺服系统的方法和装置

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133561A1 (en) * 2006-12-01 2008-06-05 Nec Laboratories America, Inc. Methods and systems for quick and efficient data management and/or processing
CN101706825A (zh) * 2009-12-10 2010-05-12 华中科技大学 一种基于文件内容类型的重复数据删除方法

Family Cites Families (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3069455B2 (ja) * 1992-12-22 2000-07-24 富士写真フイルム株式会社 画像データ圧縮伸張装置における量子化・逆量子化回路
US5870036A (en) 1995-02-24 1999-02-09 International Business Machines Corporation Adaptive multiple dictionary data compression
FR2756399B1 (fr) 1996-11-28 1999-06-25 Thomson Multimedia Sa Procede et dispositif de compression video pour images de synthese
US6243081B1 (en) * 1998-07-31 2001-06-05 Hewlett-Packard Company Data structure for efficient retrieval of compressed texture data from a memory system
US7519274B2 (en) 2003-12-08 2009-04-14 Divx, Inc. File format for multiple track digital data
US8092303B2 (en) 2004-02-25 2012-01-10 Cfph, Llc System and method for convenience gaming
US7269689B2 (en) 2004-06-17 2007-09-11 Hewlett-Packard Development Company, L.P. System and method for sharing storage resources between multiple files
US20070058874A1 (en) 2005-09-14 2007-03-15 Kabushiki Kaisha Toshiba Image data compressing apparatus and method
KR100792247B1 (ko) * 2006-02-28 2008-01-07 주식회사 팬택앤큐리텔 이미지 데이터 처리 시스템 및 그 방법
US9465823B2 (en) 2006-10-19 2016-10-11 Oracle International Corporation System and method for data de-duplication
US7814284B1 (en) * 2007-01-18 2010-10-12 Cisco Technology, Inc. Redundancy elimination by aggregation of multiple chunks
US7519635B1 (en) * 2008-03-31 2009-04-14 International Business Machines Corporation Method of and system for adaptive selection of a deduplication chunking technique
US8645333B2 (en) 2008-05-29 2014-02-04 International Business Machines Corporation Method and apparatus to minimize metadata in de-duplication
US8108353B2 (en) 2008-06-11 2012-01-31 International Business Machines Corporation Method and apparatus for block size optimization in de-duplication
AU2009335697A1 (en) 2008-12-18 2011-08-04 Copiun, Inc. Methods and apparatus for content-aware data partitioning and data de-duplication
US8140491B2 (en) 2009-03-26 2012-03-20 International Business Machines Corporation Storage management through adaptive deduplication
US8407193B2 (en) 2010-01-27 2013-03-26 International Business Machines Corporation Data deduplication for streaming sequential data storage applications
WO2011129818A1 (en) 2010-04-13 2011-10-20 Empire Technology Development Llc Adaptive compression
CN102143039B (zh) 2010-06-29 2013-11-06 华为技术有限公司 数据压缩中数据分段方法及设备
CA2809224C (en) 2010-08-31 2016-05-17 Nec Corporation Storage system
EP2612443A1 (en) 2010-09-03 2013-07-10 Loglogic, Inc. Random access data compression
RU2467499C2 (ru) * 2010-09-06 2012-11-20 Государственное образовательное учреждение высшего профессионального образования "Поволжский государственный университет телекоммуникаций и информатики" (ГОУВПО ПГУТИ) Способ сжатия цифрового потока видеосигнала в телевизионном канале связи
JP2012164130A (ja) * 2011-02-07 2012-08-30 Hitachi Solutions Ltd データ分割プログラム
EP2652587B1 (en) 2011-06-07 2017-11-15 Hitachi, Ltd. Storage system comprising flash memory, and storage control method
US9026752B1 (en) * 2011-12-22 2015-05-05 Emc Corporation Efficiently estimating compression ratio in a deduplicating file system
US8615499B2 (en) * 2012-01-27 2013-12-24 International Business Machines Corporation Estimating data reduction in storage systems
WO2014030252A1 (ja) 2012-08-24 2014-02-27 株式会社日立製作所 ストレージ装置及びデータ管理方法
US9626373B2 (en) * 2012-10-01 2017-04-18 Western Digital Technologies, Inc. Optimizing data block size for deduplication
US10817178B2 (en) * 2013-10-31 2020-10-27 Hewlett Packard Enterprise Development Lp Compressing and compacting memory on a memory device wherein compressed memory pages are organized by size

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133561A1 (en) * 2006-12-01 2008-06-05 Nec Laboratories America, Inc. Methods and systems for quick and efficient data management and/or processing
CN101706825A (zh) * 2009-12-10 2010-05-12 华中科技大学 一种基于文件内容类型的重复数据删除方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2940598A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306974A (zh) * 2019-07-30 2021-02-02 深信服科技股份有限公司 一种数据处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
BR112015023973A2 (pt) 2017-07-18
JP2016515250A (ja) 2016-05-26
KR20150104623A (ko) 2015-09-15
RU2626334C2 (ru) 2017-07-26
EP2940598A1 (en) 2015-11-04
EP2940598B1 (en) 2019-12-04
CA2898667A1 (en) 2015-02-26
CA2898667C (en) 2019-01-15
CN105051724A (zh) 2015-11-11
CN105051724B (zh) 2018-09-28
EP2940598A4 (en) 2016-06-01
RU2015139685A (ru) 2017-03-22
US20170017407A1 (en) 2017-01-19
US10359939B2 (en) 2019-07-23
KR101653692B1 (ko) 2016-09-02
JP6110517B2 (ja) 2017-04-05

Similar Documents

Publication Publication Date Title
WO2015024160A1 (zh) 一种数据对象处理方法与装置
Kruus et al. Bimodal content defined chunking for backup streams.
US20140244604A1 (en) Predicting data compressibility using data entropy estimation
JP6537214B2 (ja) 重複排除方法および記憶デバイス
US9048862B2 (en) Systems and methods for selecting data compression for storage data in a storage system
US10938961B1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
KR20200024193A (ko) 데이터 전송의 단일 패스 엔트로피 검출 장치 및 방법
Bhalerao et al. A survey: On data deduplication for efficiently utilizing cloud storage for big data backups
WO2014067063A1 (zh) 重复数据检索方法及设备
CN113296709B (zh) 用于去重的方法和设备
WO2013075668A1 (zh) 重复数据删除方法和装置
US8117343B2 (en) Landmark chunking of landmarkless regions
US20220156233A1 (en) Systems and methods for sketch computation
JP2023510134A (ja) スケッチ計算のためのシステムおよび方法
US20210191640A1 (en) Systems and methods for data segment processing
Vikraman et al. A study on various data de-duplication systems
Abdulsalam et al. Evaluation of Two Thresholds Two Divisor Chunking Algorithm Using Rabin Finger print, Adler, and SHA1 Hashing Algorithms
CN110968575B (zh) 一种大数据处理系统的去重方法
Majed et al. Cloud based industrial file handling and duplication removal using source based deduplication technique
BR112015023973B1 (pt) Método e aparelho de processamento de objeto de dados
CN115809013A (zh) 一种数据重删方法及相关装置
CN114625316A (zh) 应用在重复数据删除的基于内容分块方法、系统及介质
RANI et al. A COMBINED STUDY OF HYBRID APPROACHES FOR TEXT AND IMAGE DE-DUPLICATION

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201380003213.3

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13892074

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2898667

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2013892074

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 20157021462

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2015561907

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2015139685

Country of ref document: RU

Kind code of ref document: A

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112015023973

Country of ref document: BR

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112015023973

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20150917