WO2015024160A1 - 一种数据对象处理方法与装置 - Google Patents
一种数据对象处理方法与装置 Download PDFInfo
- Publication number
- WO2015024160A1 WO2015024160A1 PCT/CN2013/081757 CN2013081757W WO2015024160A1 WO 2015024160 A1 WO2015024160 A1 WO 2015024160A1 CN 2013081757 W CN2013081757 W CN 2013081757W WO 2015024160 A1 WO2015024160 A1 WO 2015024160A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- length
- block
- data segment
- compression rate
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 18
- 238000007906 compression Methods 0.000 claims abstract description 213
- 230000006835 compression Effects 0.000 claims abstract description 213
- 238000005192 partition Methods 0.000 claims abstract description 130
- 238000005070 sampling Methods 0.000 claims abstract description 94
- 238000000034 method Methods 0.000 claims abstract description 64
- 238000012545 processing Methods 0.000 claims description 13
- 238000000638 solvent extraction Methods 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000004931 aggregating effect Effects 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 23
- 230000000903 blocking effect Effects 0.000 description 21
- 238000013507 mapping Methods 0.000 description 18
- 230000000694 effects Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 230000007704 transition Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000002902 bimodal effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000002620 method output Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3059—Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
- H03M7/3064—Segmenting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/068—Hybrid storage device
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3091—Data deduplication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1041—Resource optimization
- G06F2212/1044—Space efficiency improvement
Definitions
- the present invention relates to the field of information technology, and in particular, to a data object processing method and apparatus. Background technique
- Data Deduplication is the process of discovering and eliminating duplicate content in a data set or data stream to improve the storage and/or transmission efficiency of data. It is also called Duplicate Data Elimination, referred to as deduplication or deduplication. .
- Deduplication techniques typically split a data set or data stream into a series of data units and retain only one copy of the repeated data units, thereby reducing the space overhead in the data storage process or the bandwidth consumption during transmission.
- the hash value h) of the data block can be calculated as a fingerprint, and the data unit having the same fingerprint is defined as duplicate data.
- Deduplication units commonly used in the prior art include files, fixed length blocks, content-based variable length chunks, and the like.
- the Content Defined Chunking (CDC) uses a sliding window to scan data and identify a byte string that conforms to a preset feature, and marks the location of the byte string as a block boundary, and then the data set or data. The stream is split into variable length block sequences.
- the method selects block boundaries based on the content characteristics of the data, and can more sensitively find data units shared by similar files or data streams, and thus is widely used in various data deduplication schemes.
- Research shows that when using the content-based blocking method to split the data set or data stream, the smaller the granularity of the block, the higher the probability of finding duplicate data, the better the deduplication effect; but the smaller the granularity of the block means The more blocks are obtained by dividing a given data set, thereby increasing the index overhead and the complexity of finding duplicate data, thereby reducing the time efficiency of data deduplication.
- the Content Defined Chunking (CDC) method is used to control the key parameters of the block granularity.
- the CDC blocking method outputs a variable length block sequence for a given data object, each block The length is statistically obeying a normal distribution, and the expected length is used to adjust the mean of the normal distribution.
- the mean of the normal distribution is expressed as an average block length. Since the random variable has the highest probability of taking the mean value under the normal distribution, the average block length is also called the peak length. Ideally it can be equal to the desired length.
- the prior art proposes a content-based bimodal block method.
- the core idea is to use two variable length blocking modes with different expected lengths.
- the deduplication storage system is used to determine the repeatability of the candidate blocks, between the repeated data and the non-repeating data.
- the transition zone uses a small block mode and the non-transition zone uses a large block mode.
- the block computing device needs to frequently query the fingerprint of the existing data block in the storage device when determining how the data object is divided.
- the data stored in the heavy storage device is deduplicated.
- the data block determines whether a transition region of the repeated data and the non-repeating data is encountered according to the repeatability of the candidate block, and further determines which block mode is finally adopted. Therefore, this prior art has caused query load pressure on the deduplication storage device. Summary of the invention
- Embodiments of the present invention provide a data object processing technique, which can split a data object into data blocks.
- an embodiment of the present invention provides a data object processing method, including: dividing a data object into one or more partitions; calculating a sampling compression ratio of each partition, and successively partitioning the sampling compression ratio with common features Aggregating into a data segment, obtaining a sampling compression ratio of each of the data segments; selecting a length interval according to a length of each of the data segments, and a compression ratio interval to which a sampling compression ratio of each data segment belongs The desired length splits the data segment into data blocks, wherein the sample compression ratio of each of the data segments uniquely belongs to one of the compression rate intervals, and each of the data The length of the segment uniquely belongs to one of the length intervals.
- an embodiment of the present invention provides a data object processing apparatus, where the apparatus includes: a partition dividing module, configured to divide a data object into one or more partitions; and a data segment generating module, configured to calculate a sampling of each partition.
- a compression ratio a continuous partition having a common feature of the sampling compression ratio is aggregated into one data segment, and a sampling compression ratio of each of the data segments is obtained;
- a data block generating module according to a length interval to which each length of the data segment belongs, a compression ratio interval to which the sampling compression ratio of each data segment belongs, selecting a desired length to split the data segment into data blocks, wherein the sampling compression ratio of each of the data segments uniquely belongs to one of the compression ratio intervals.
- the length of each of the data segments uniquely belongs to one of the length intervals.
- the data object is first divided into partitions, and then the partition is aggregated into data segments according to the compression ratio of each partition, and then split into data blocks, thereby realizing the effect of splitting the data objects into data blocks.
- FIG. 1 is a flow chart of an embodiment of a data object processing method
- FIG. 2 is a flow chart of an embodiment of a data object processing method
- FIG. 3 is a flow chart of an embodiment of a data object processing method
- FIG. 4 is a flow chart of an embodiment of a data object processing method
- FIG. 5 is a flow chart of an embodiment of a data object processing method
- Figure 6 is a structural diagram of an embodiment of a data object processing apparatus. detailed description
- the prior art proposes to adopt fine-grained partitioning in the transition area between the repeated content and the non-repetitive content, and coarse-grained blocking in other areas, thereby forming a bimodal blocking method.
- this method needs to frequently query the repeatability of candidate blocks in the blocking process, thereby causing query load pressure on the deduplication storage system.
- the block result of this method depends on the transmission order of the multi-version data and is not stable.
- the embodiment of the invention provides a multi-peak data de-blocking method based on content and compressibility, which can split the data object into data blocks.
- the data object is first divided into partitions, and then the partition is aggregated into data segments according to the sampling compression ratio of each partition, and then the desired length is selected according to the length of the data segment and the sampling compression ratio.
- Splitting into data blocks separately achieves the effect of splitting the data object into a sequence of data blocks having a multi-peak length distribution.
- a data object refers to a piece of data that can be operated, such as a file or a data stream.
- the partitioning strategy mapping table maintains the mapping relationship between the compression rate interval, the length interval, and the expected segment length. The higher the compression ratio of the data segment and the longer the length, the larger the corresponding expected segment length.
- the method of the embodiment of the present invention can split the data object into data blocks, and the split data block can be used as a unit for data deduplication, thereby reducing storage space occupation and not storing data, and storing or transmitting the data object.
- the data blocks generated by the split can also be used for purposes other than data deduplication.
- Embodiments of the present invention include: Step (a), a data object (Data Object) is input to a blocking device, and the data object may be from a block device or may be from a memory inside the blocking device, such as a file or a data stream, The operation requirement is not limited in this embodiment; in step (b), the data object is divided into one or more partitions (Block), and the compression ratio of each partition is estimated by using the sampling method, and the block policy mapping table is queried.
- a data object Data Object
- the data object is divided into one or more partitions (Block)
- the compression ratio of each partition is estimated by using the sampling method, and the block policy mapping table is queried.
- step (c) for each data segment, query the blocking policy mapping table according to its corresponding compression ratio
- the interval and the length interval select a desired length, and the content-based blocking method divides the data segment into a chunked (Chunk) sequence according to the selected desired length; and (d), the adjacent chunks located in different data segments
- the splicing is performed, and the hash value of each block is calculated for data deduplication, wherein the splicing step is an optional step, and the hash value may be directly calculated without splicing.
- Step (b) using the sampling compression rate information of the partition sequence to re-divide the data object into a sequence of data segments may be various.
- step (b) may also refer to the compression rate interval.
- the adjacent consecutive partitions whose sampling compression ratio difference is less than the specified threshold are aggregated into one data segment.
- Step 11 Divide the data object to be de-duplicated into one or more partitions, calculate a sampling compression ratio of each partition, and aggregate adjacent consecutive partitions having a common compression rate feature into one data segment, and obtain each of the data segments. Sampling compression ratio.
- each partition can be fixed length or indefinite length. In the case of a fixed length, a random length may be selected as a partition length within a certain range; a data object may also be scanned and multiple sets of candidate block boundaries having different desired lengths may be output, and a set of candidate block boundaries may be used.
- the data object is divided into one or more variable length partitions.
- a method for estimating the compression ratio is: for each partition, extract a piece of sample data according to the sampling ratio S, and compress the sample data by a data compression algorithm, such as a lossless compression algorithm LZ algorithm, RLE coding, calculate a compression ratio, and compress the sample. Rate as the compression ratio of the sampled partition. The higher the sampling ratio, the closer the compression ratio of the sample is to the true compression ratio of the sampled partition.
- the compression ratio is generally less than 1; for an incompressible data segment, since the metadata overhead such as the description field is increased, the length of the compression encoding may be greater than the original data length, so that the compression ratio may occur. More than 1 case.
- the compression rate entry contains a series of compression rate intervals. The intersection between different compression rate intervals is empty, and the union of all compression rate intervals constitutes the complete compression rate range [0, ⁇ ).
- the compression rate feature refers to the compression ratio as a parameter of the aggregation partition, the compression ratio or the compression ratio.
- the partition obtained by the row operation conforms to the preset condition, that is, the partition conforming to the compression rate feature.
- the compression rate feature may be a compression rate interval, that is, a range of compression ratios; or may be a threshold value of a difference in compression ratios of adjacent partitions. Referring to the example of Table 1, the compression ratio section is used as the compression ratio feature.
- the data object is divided into partitions 1 and 7 and a total of 7 partitions. By sampling and estimating the compression ratio for each partition, the sampling compression ratio of each partition is obtained.
- the sampling compression ratio of partition 1 is 0.4
- the sampling compression ratio of partition 2 is 0.42
- each sampling compression rate belongs to a compression ratio interval, such as compression.
- the rate of 0.4 is the compression ratio interval [0, 0.5)
- the compression ratio of 0.61 belongs to the compression ratio interval [0.5, 0.8).
- each partition has a sampling compression ratio, it can be considered that each partition belongs to a compression ratio interval;
- a partition of a compression rate interval is aggregated into one data segment, so partition 1 and partition 2 can be aggregated into data segment 1, partition 3, partition 4, and partition 5 can be aggregated into data segment 2, and partition 6 and partition 7 can be aggregated into data segments. 3.
- Step 12 According to the length interval to which the length of each data segment belongs, the compression ratio interval to which the sampling compression ratio of each data segment belongs, select a desired length to split the data segment into data blocks.
- the length range of the data segment is divided into at least one length interval, and the length of each of the data segments uniquely belongs to one of the length intervals, and the sampling compression rate of each of the data segments belongs to only one The compression ratio interval.
- sampling compression ratio of the data segment can be obtained by sampling and compressing the data segment; or by sampling the compression ratio of each partition constituting the data segment.
- the average value is obtained, and the average value may be an arithmetic mean value; the weighted average value of the sampling compression ratios of each partition may also be calculated according to the length of each partition, and the specific method is to use the sampling compression ratio of each partition constituting the data segment as a flag.
- the value, the length of each partition as a weight calculates the weighted average of the sample compression ratios for each partition.
- this step may Selecting, from the candidate block boundaries, a block boundary having the same desired length to split the data segment into data blocks according to the selected desired length, that is, from the plurality of sets of candidate boundaries obtained by scanning in step 11. You can select the corresponding boundary without scanning the data segment again.
- this step may be to select a desired length, and then scan the data segment to find the boundary of each partition, thereby splitting the data segment into data blocks, as compared with The method of calculating a plurality of candidate block boundaries having different desired lengths adds one step of scanning the data segments to find the block boundaries.
- the range of the compression ratio is divided into [0, 0.5), [0.5, 0.8), [0.8,
- the sampling compression ratio of each data segment belongs to a compression ratio interval, and can also be understood as a compression ratio interval corresponding to each data segment.
- the length range of the data segment is divided into at least one length interval, and there is no intersection between the length intervals, so each data segment corresponds to one length interval, and the length interval and the compression rate interval jointly determine the expected length of one data segment.
- the length of the data segment A belongs to the length [0MB, 10 MB), and the sampling compression ratio of the data segment A is within [0, 0.5), and the length interval [0MB, 10 MB), the compression ratio interval [0, 0.5).
- each data block may be different, that is, it is indefinitely long, but in the data block formed by splitting the data segment A, the average block length is close to the expected length of 32 KB, which is uncertain.
- the long data block has the highest probability of 32KB, and 32KB is also a peak length.
- each desired length corresponds to one peak length. Therefore, the embodiment of the present invention is a data blocking and deduplication method with multiple peak lengths.
- the desired length can be obtained using empirical values or by analytical statistics.
- An optional determination rule of the desired length as shown in the example of Table 2, for the same compression rate interval, the corresponding expected length increases as the lower limit of the length interval increases; on the other hand, for the same length interval
- the lower limit of the compression ratio interval increases, its corresponding expected length in the mapping table also increases.
- the expected block length of the data segment is positively correlated with the lower limit of the compression rate interval to which the data segment sampling compression rate belongs and the lower limit of the length interval to which the data segment length belongs.
- the deduplication ratio of big data objects and hard-to-compress data objects is not sensitive to block granularity. With this desired length of block mode, the number of blocks can be quickly reduced without causing rapid de-emphasis.
- the data blocks formed by splitting each data segment have a sequence of data. These data blocks having a sequence before and after form form a data block sequence, which may also be referred to as a block subsequence.
- the step 12 may further include splicing the block sub-sequences generated by splitting different data blocks to form a block sequence of the data object, the sequence of each block in the block sequence, and before and after the block data in the data object. The order is the same.
- Step 13 splicing the end data block of the previous data segment and the first data block of the subsequent data segment into one data block in the adjacent data segment, and the data block formed by the splicing may be referred to as a spliced data block.
- step 13 is an optional step. For example, if the boundary of the data segment in step 11 comes from the boundary of the fixed length partition, step 13 can be selected. After the splicing, the boundary of the fixed length partition can be avoided to be used as the boundary of the block, which can be better. The effect of deduplication.
- the spliced data block generated by the transition region between the two data segments has adjacent data blocks corresponding to different desired lengths.
- the spliced data block may be removed by using two smaller values of the desired length. Divided into more granular data blocks, you can better find the duplicate content of the transition area between the two data segments, and improve the deduplication effect when the total amount of data blocks increases little.
- the spliced data block may also be split into more granular data blocks using a smaller desired length than the smaller of the two desired lengths.
- step 13 may be performed prior to splicing the block subsequence or after splicing the block subsequence.
- Step 14 Calculate the fingerprint of each data block, determine whether the data block has been stored in the storage device by using the fingerprint, and send the data block that is not stored in the storage device, and the fingerprint and sampling compression ratio corresponding to the unstored data block to the storage device. .
- the fingerprint is used to uniquely identify the data block, and the fingerprint corresponding to the data block and the data block has a corresponding relationship.
- One method of fingerprint calculation is to calculate a hash value of a data block as a fingerprint.
- Step 15 The storage device stores the received data and the fingerprint of the data block. During storage, it can be determined whether the sampling compression rate of the received data block meets the compression rate threshold, and the data block that meets the compression rate threshold is compressed and stored to save storage space; the data that does not meet the compression rate threshold is not compressed and directly stored. . For example, if the compression rate threshold is 0.7 or less, the data block with a compression ratio of 0.7 or less may be compressed and stored, and the data with a compression ratio greater than 0.7 may be directly stored without compression.
- the segmentation method of step 11 is more than one.
- the segmentation strategy of step 11 can be modified to: the difference between the sample compression ratios is less than a specified threshold, and adjacent consecutive partitions are aggregated into the same data segment. That is to say, the difference between the sampling compression ratio and the specified threshold is a characteristic of the compression ratio. Still exemplified in Table 1, if the difference in sampling compression ratio is less than 0.1 Is the threshold. Then 0.42 - 0.4 ⁇ 0.1, so partition 1 and partition 2 are divided into the same data segment; and 0.53 - 0.42 > 0.1, so partition 3 is not divided into the same data segment with partition 2.
- partition 1 and partition 2 are divided into the first data segment
- partition 3 and partition 4 are divided into the second data segment
- partition 5, partition 6, and partition 7 are divided into the third data segment.
- the second embodiment is described in detail in FIG. 2.
- the data object processing method in this embodiment includes the following steps.
- the blocking policy mapping table refers Table 1, Table 2, and the blocking policy mapping table records which compression ratios belong to the same compression rate feature; in addition, the segmentation policy mapping table also records the length to which the data segment belongs. The compression ratio interval to which the compression ratio of the interval and the data segment belongs, and the expected length determined by the two. The desired length is used to split the data segment into data blocks. This step can also be performed later, before you need to use the blocking strategy mapping table.
- the data object to be processed is obtained, and the source of the acquisition may be the device that executes the method, or the external device connected to the device that performs the method.
- this step needs to scan the data object to obtain the boundary of each fixed length partition, and the data between the adjacent two boundaries is a partition.
- the consecutive partitions having the same compression rate feature are aggregated into data segments.
- each data segment is divided into data blocks having a specific desired length.
- this step needs to scan each data segment, find the boundary of each data block, and split the data segment with the boundary to form a series of data blocks.
- the execution of the data object 26 completes the processing of the data object, and the effect of splitting the data object into data blocks is realized. Subsequent steps 27, 28, and 29 are further steps on the data object processing method. Expand.
- a splicing method is: splicing the last data block of the previous data segment and the first data block of the latter data segment into one data block for any two adjacent data segments, thereby eliminating the data segments.
- a boundary a plurality of sub-sequences are combined to form a data block sequence of the entire data object.
- the data block spliced in this step can be split into smaller-sized data blocks to improve the effect of deduplication.
- Another splicing method is that each data block remains unchanged, and the block subsequences of each data segment are sorted according to the order of the data segments in the data object, and the first data segment of the next data segment takes over the end of the previous data segment.
- the block compression ratio of the block can be obtained by using the sample compression ratio of the segment to which it belongs.
- One method is to directly use the segmented sample compression rate as the sample compression ratio of each block contained therein.
- a specific implementation method of the above steps is: scanning a data object by using a w-bytes sliding window, and re-calculating the fingerprint f (w-bytes) of the data in the window by using a fast hash algorithm for each forward sliding one byte, if current data
- the function Match is used to map f(w-bytes) to the interval [0, ⁇ ). Since the hash function f) is random, the filter condition can output a series of blocks with a peak length of E.
- a block subsequence with a new peak length is output.
- the previous boundary of the data block is determined by the expected length of the previous data segment, and the latter boundary is determined by the expected length of the latter data segment, so before and after the data block
- the boundaries are located in two data segments, respectively The splicing of the sub-sequences between adjacent segments is automatically completed, so the process does not introduce segment boundaries as block boundaries.
- the second embodiment further provides a flowchart 3, and the step flow is: 3a, inputting a data object that needs to execute a data object processing method, and the data object may be temporarily stored in the cache before being processed by the executed data object; 3b , partitioning, sampling, and estimating the compression ratio of the data object, L, M, and H respectively represent three different compression rate intervals; 3c, the continuous partitions with the common compression rate feature are aggregated into data segments, 3d, according to each data segment The compression ratio and length feature select the desired length and calculate the block boundary.
- the data segments are divided into data blocks according to the calculated block boundaries, and the data blocks split into each data segment form a block subsequence; 3e , splicing the block subsequences of each data segment and calculating the block fingerprint.
- the data object to be deduplicated is divided into a plurality of fixed length partitions.
- the data object to be deduplicated by the desired length is pre-blocked, specifically, the content object-based blocking method is used to scan the data object (file/data stream) and generate multiple sets of candidates with different expected lengths.
- Block boundaries, where each candidate block boundary corresponds to a candidate block sequence of the data object, and the subsequent block sequence can also understand the block scheme.
- the data object is divided into partitions by using one of the pre-blocking schemes, and is used for performing sample compression and determining the division method of the data segments.
- the corresponding length may be selected according to the desired length.
- the candidate block boundaries are used to split the data block.
- the second embodiment needs to scan the data object when dividing the data object into partitions and splitting the data segments into data blocks.
- the data object needs to be scanned once, which saves system resources and improves the processing efficiency of the data object.
- the partition boundary and the data block boundary in the third embodiment are based on the candidate partition boundary, the segment boundary generated by the partition aggregation does not adversely affect the deduplication, and the operation of removing the segment boundary of the data block is not required. That is to say, it is not necessary to splicing the last data block of the previous data segment in the adjacent data segment with the first data block of the latter data segment into one data block.
- the data object processing method in this embodiment may specifically include the following steps.
- the blocking policy mapping table refers Table 1, Table 2, and the blocking policy mapping table records which compression ratios belong to the same compression rate feature; in addition, the segmentation policy mapping table also records the length to which the data segment belongs.
- the data object to be processed is obtained, and the source of the acquisition may be the device that executes the method, or the external device connected to the device that performs the method.
- each set of candidate block boundaries corresponds to a desired length.
- These desired lengths include the desired length for each data segment in subsequent steps.
- step 44 From the plurality of sets of candidate block boundaries in step 43, select a set of candidate boundaries, and sample and compress the candidate block sequences formed by the set of candidate boundaries.
- the data object is partitioned, and the expected length of the partition is one of a plurality of expected lengths in step 43.
- the candidate partition corresponding to the desired length is directly used.
- the boundary can be divided into partitions. Then, the partition data is sample-compressed, and the compression ratio obtained by the sample compression is used as the sampling compression ratio of the entire partition data.
- the consecutive partitions with the same compression rate feature are aggregated into data segments.
- this step is to split the data segment into data blocks. .
- the candidate block boundary of this step 46 is from step 43, the data object is not required to be repeatedly scanned when the data segment is split into data blocks, thereby saving system resources.
- the execution of the data object is completed after the execution of the data 46, and the effect of splitting the data object into data blocks is realized.
- Subsequent steps 47, 48, 49 are further extensions to the data object processing method.
- 47. splicing a block subsequence of adjacent data segments. The subsequences of the data segments are sorted according to the order of the data segments in the data object to form a data block sequence of the data objects. Data chunking can also be called a data chunk.
- the block compression ratio of the block can be obtained by using the sample compression ratio of the segment to which it belongs.
- One method is to directly use the segmented sample compression rate as the sample compression ratio of each block contained therein.
- Embodiment 3 additionally provides Flowchart 5.
- the flow of each step in FIG. 5 is: 5a, inputting a data object that needs to execute a data object processing method, and the data object may be temporarily stored in the cache before being processed by the executed data object; 5b, determining multiple sets of candidates with different expected lengths Blocking boundary; 5c, in the plurality of candidate block boundaries determined by 5b, selecting one of the candidate block sequences to divide the data object, and sampling the divided block and estimating the compression rate; 5d, will have The consecutive candidate blocks of the common compression rate feature are aggregated into data segments; 5e, the desired length and the corresponding block boundary are selected according to the compression ratio and length characteristics of each data segment, and each data segment is in accordance with the corresponding segment edge The data blocks split into boundaries form a block subsequence; 5f, the block subsequences of each data segment are spliced and the block fingerprint is calculated.
- the embodiment describes a data object processing apparatus 6, and the method of the first embodiment, the second embodiment, and the third embodiment can be applied.
- the data object processing apparatus 6 includes: a partition division module 61, a data segment generation module 62, and a data block generation module 63.
- the partition dividing module 61 is configured to divide the data object into one or more partitions; the data segment generating module 62 is configured to calculate a sampling compression ratio of each partition, and the sampling compression ratio has a common feature.
- the continuous partition is aggregated into one data segment, and the sampling compression ratio of each data segment is obtained.
- the data block generating module 63 belongs to the length interval to which the length of each data segment belongs, and the compression ratio of each data segment. a compression ratio interval, selecting a desired length to split the data segment into data blocks, wherein a compression ratio of each of the data segments uniquely belongs to one of the compression ratio intervals, and each of the data segments has a length that belongs to only one The length interval.
- the partition dividing module 61 may divide the data object into one or more fixed length partitions; or may be used to calculate multiple sets of candidate block boundaries having different desired lengths, and use one set of candidate block boundaries. , dividing the data object into one or more variable length partitions.
- the data segment generating module 62 may be specifically configured to: calculate a sampling compression ratio of each partition, and combine the sampling compression rate into the same compression rate interval and aggregate the adjacent consecutive partitions into one data segment, and obtain sampling compression of each of the data segments. rate.
- the data segment generating module 62 may be further configured to: calculate a sampling compression ratio of each partition, and compare the difference between the sampling compression ratios to a specified threshold and the adjacent consecutive partitions into one data segment, and obtain the data segments. Sampling compression ratio.
- the data block generation module 63 has a function of selecting a desired length to split the data segment into data blocks, and specifically, may be: according to the selected desired length, From among the plurality of sets of candidate block boundaries having different desired lengths calculated by the partitioning module, the block boundaries having the same desired length are selected to split the data segments into data blocks.
- the data block generating module 63 can also be used to put the last data block of the previous data segment in the adjacent data segment.
- the first data block of the latter data segment is spliced into a spliced data block.
- the data block generating module 63 may be further configured to: split the spliced data block into a plurality of data blocks, and use a desired length of the split to be less than or equal to a desired length corresponding to the previous data segment, and The desired length used by the sub-segment is less than or equal to the desired length corresponding to the latter data segment.
- the object processing device 6 may further include a data block transmitting module 64.
- the data block sending module 64 is configured to calculate a fingerprint of each of the data blocks, determine, by the fingerprint, whether each of the data blocks is already stored in the storage device, and the data blocks that are not stored in the storage device, and the unstored data.
- the fingerprint corresponding to the block and the sampling compression rate are sent to the storage device, and the storage device stores the received data block fingerprint, and determines whether the sampling compression rate of the received data block meets the compression rate threshold, and the data block that meets the compression rate threshold is used. Stored after compression.
- the object processing device 6 can also be regarded as a device composed of a CPU and a memory.
- the program is stored in the memory, and the CPU executes the methods of the first embodiment, the second embodiment or the third embodiment through the program in the memory. It can also include an interface for connecting to a storage device. The function of the interface can, for example, send data blocks generated after processing by the CPU to the storage device.
- the present invention can be implemented by means of software plus necessary general hardware, and of course, by hardware, but in many cases the former is a better implementation.
- the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer.
- a hard disk or optical disk or the like includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
Abstract
Description
Claims
Priority Applications (9)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020157021462A KR101653692B1 (ko) | 2013-08-19 | 2013-08-19 | 데이터 오브젝트 처리 방법 및 장치 |
RU2015139685A RU2626334C2 (ru) | 2013-08-19 | 2013-08-19 | Способ и устройство обработки объекта данных |
CA2898667A CA2898667C (en) | 2013-08-19 | 2013-08-19 | Data object processing method and apparatus |
PCT/CN2013/081757 WO2015024160A1 (zh) | 2013-08-19 | 2013-08-19 | 一种数据对象处理方法与装置 |
EP13892074.9A EP2940598B1 (en) | 2013-08-19 | 2013-08-19 | Data object processing method and device |
CN201380003213.3A CN105051724B (zh) | 2013-08-19 | 2013-08-19 | 一种数据对象处理方法与装置 |
JP2015561907A JP6110517B2 (ja) | 2013-08-19 | 2013-08-19 | データオブジェクト処理方法及び装置 |
BR112015023973-0A BR112015023973B1 (pt) | 2013-08-19 | Método e aparelho de processamento de objeto de dados | |
US14/801,421 US10359939B2 (en) | 2013-08-19 | 2015-07-16 | Data object processing method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2013/081757 WO2015024160A1 (zh) | 2013-08-19 | 2013-08-19 | 一种数据对象处理方法与装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/801,421 Continuation US10359939B2 (en) | 2013-08-19 | 2015-07-16 | Data object processing method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015024160A1 true WO2015024160A1 (zh) | 2015-02-26 |
Family
ID=52482915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2013/081757 WO2015024160A1 (zh) | 2013-08-19 | 2013-08-19 | 一种数据对象处理方法与装置 |
Country Status (8)
Country | Link |
---|---|
US (1) | US10359939B2 (zh) |
EP (1) | EP2940598B1 (zh) |
JP (1) | JP6110517B2 (zh) |
KR (1) | KR101653692B1 (zh) |
CN (1) | CN105051724B (zh) |
CA (1) | CA2898667C (zh) |
RU (1) | RU2626334C2 (zh) |
WO (1) | WO2015024160A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112306974A (zh) * | 2019-07-30 | 2021-02-02 | 深信服科技股份有限公司 | 一种数据处理方法、装置、设备及存储介质 |
Families Citing this family (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10430383B1 (en) * | 2015-09-30 | 2019-10-01 | EMC IP Holding Company LLC | Efficiently estimating data compression ratio of ad-hoc set of files in protection storage filesystem with stream segmentation and data deduplication |
CN106445412A (zh) * | 2016-09-14 | 2017-02-22 | 郑州云海信息技术有限公司 | 一种数据卷压缩率的评估方法和系统 |
CN107014635B (zh) * | 2017-04-10 | 2019-09-27 | 武汉轻工大学 | 粮食均衡抽样方法及装置 |
CN107203496B (zh) * | 2017-06-01 | 2020-05-19 | 武汉轻工大学 | 粮食分配抽样方法及装置 |
US11010233B1 (en) | 2018-01-18 | 2021-05-18 | Pure Storage, Inc | Hardware-based system monitoring |
CN110413212B (zh) * | 2018-04-28 | 2023-09-19 | 伊姆西Ip控股有限责任公司 | 识别待写入数据中的可简化内容的方法、设备和计算机程序产品 |
US11232075B2 (en) * | 2018-10-25 | 2022-01-25 | EMC IP Holding Company LLC | Selection of hash key sizes for data deduplication |
CN111722787B (zh) * | 2019-03-22 | 2021-12-03 | 华为技术有限公司 | 一种分块方法及其装置 |
CN111831297B (zh) * | 2019-04-17 | 2021-10-26 | 中兴通讯股份有限公司 | 零差分升级方法及装置 |
CN110113614B (zh) * | 2019-05-13 | 2022-04-12 | 格兰菲智能科技有限公司 | 图像处理方法及图像处理装置 |
US11157189B2 (en) * | 2019-07-10 | 2021-10-26 | Dell Products L.P. | Hybrid data reduction |
US11615185B2 (en) | 2019-11-22 | 2023-03-28 | Pure Storage, Inc. | Multi-layer security threat detection for a storage system |
US11625481B2 (en) | 2019-11-22 | 2023-04-11 | Pure Storage, Inc. | Selective throttling of operations potentially related to a security threat to a storage system |
US11941116B2 (en) | 2019-11-22 | 2024-03-26 | Pure Storage, Inc. | Ransomware-based data protection parameter modification |
US11341236B2 (en) | 2019-11-22 | 2022-05-24 | Pure Storage, Inc. | Traffic-based detection of a security threat to a storage system |
US20210382992A1 (en) * | 2019-11-22 | 2021-12-09 | Pure Storage, Inc. | Remote Analysis of Potentially Corrupt Data Written to a Storage System |
US11687418B2 (en) | 2019-11-22 | 2023-06-27 | Pure Storage, Inc. | Automatic generation of recovery plans specific to individual storage elements |
US11675898B2 (en) | 2019-11-22 | 2023-06-13 | Pure Storage, Inc. | Recovery dataset management for security threat monitoring |
US11651075B2 (en) | 2019-11-22 | 2023-05-16 | Pure Storage, Inc. | Extensible attack monitoring by a storage system |
US11657155B2 (en) | 2019-11-22 | 2023-05-23 | Pure Storage, Inc | Snapshot delta metric based determination of a possible ransomware attack against data maintained by a storage system |
US11645162B2 (en) | 2019-11-22 | 2023-05-09 | Pure Storage, Inc. | Recovery point determination for data restoration in a storage system |
US11500788B2 (en) | 2019-11-22 | 2022-11-15 | Pure Storage, Inc. | Logical address based authorization of operations with respect to a storage system |
US11755751B2 (en) | 2019-11-22 | 2023-09-12 | Pure Storage, Inc. | Modify access restrictions in response to a possible attack against data stored by a storage system |
US11720692B2 (en) | 2019-11-22 | 2023-08-08 | Pure Storage, Inc. | Hardware token based management of recovery datasets for a storage system |
US11520907B1 (en) * | 2019-11-22 | 2022-12-06 | Pure Storage, Inc. | Storage system snapshot retention based on encrypted data |
US11720714B2 (en) | 2019-11-22 | 2023-08-08 | Pure Storage, Inc. | Inter-I/O relationship based detection of a security threat to a storage system |
TWI730600B (zh) * | 2020-01-21 | 2021-06-11 | 群聯電子股份有限公司 | 資料寫入方法、記憶體控制電路單元以及記憶體儲存裝置 |
KR102337673B1 (ko) * | 2020-07-16 | 2021-12-09 | (주)휴먼스케이프 | 데이터 열람 검증 시스템 및 그 방법 |
JP2022030385A (ja) * | 2020-08-07 | 2022-02-18 | 富士通株式会社 | 情報処理装置及び重複率見積もりプログラム |
CN112102144B (zh) * | 2020-09-03 | 2023-08-22 | 海宁奕斯伟集成电路设计有限公司 | 压缩数据的排布方法、装置和电子设备 |
CN112560344B (zh) * | 2020-12-14 | 2023-12-08 | 北京云歌科技有限责任公司 | 一种构建模型伺服系统的方法和装置 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080133561A1 (en) * | 2006-12-01 | 2008-06-05 | Nec Laboratories America, Inc. | Methods and systems for quick and efficient data management and/or processing |
CN101706825A (zh) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | 一种基于文件内容类型的重复数据删除方法 |
Family Cites Families (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3069455B2 (ja) * | 1992-12-22 | 2000-07-24 | 富士写真フイルム株式会社 | 画像データ圧縮伸張装置における量子化・逆量子化回路 |
US5870036A (en) | 1995-02-24 | 1999-02-09 | International Business Machines Corporation | Adaptive multiple dictionary data compression |
FR2756399B1 (fr) | 1996-11-28 | 1999-06-25 | Thomson Multimedia Sa | Procede et dispositif de compression video pour images de synthese |
US6243081B1 (en) * | 1998-07-31 | 2001-06-05 | Hewlett-Packard Company | Data structure for efficient retrieval of compressed texture data from a memory system |
US7519274B2 (en) | 2003-12-08 | 2009-04-14 | Divx, Inc. | File format for multiple track digital data |
US8092303B2 (en) | 2004-02-25 | 2012-01-10 | Cfph, Llc | System and method for convenience gaming |
US7269689B2 (en) | 2004-06-17 | 2007-09-11 | Hewlett-Packard Development Company, L.P. | System and method for sharing storage resources between multiple files |
US20070058874A1 (en) | 2005-09-14 | 2007-03-15 | Kabushiki Kaisha Toshiba | Image data compressing apparatus and method |
KR100792247B1 (ko) * | 2006-02-28 | 2008-01-07 | 주식회사 팬택앤큐리텔 | 이미지 데이터 처리 시스템 및 그 방법 |
US9465823B2 (en) | 2006-10-19 | 2016-10-11 | Oracle International Corporation | System and method for data de-duplication |
US7814284B1 (en) * | 2007-01-18 | 2010-10-12 | Cisco Technology, Inc. | Redundancy elimination by aggregation of multiple chunks |
US7519635B1 (en) * | 2008-03-31 | 2009-04-14 | International Business Machines Corporation | Method of and system for adaptive selection of a deduplication chunking technique |
US8645333B2 (en) | 2008-05-29 | 2014-02-04 | International Business Machines Corporation | Method and apparatus to minimize metadata in de-duplication |
US8108353B2 (en) | 2008-06-11 | 2012-01-31 | International Business Machines Corporation | Method and apparatus for block size optimization in de-duplication |
AU2009335697A1 (en) | 2008-12-18 | 2011-08-04 | Copiun, Inc. | Methods and apparatus for content-aware data partitioning and data de-duplication |
US8140491B2 (en) | 2009-03-26 | 2012-03-20 | International Business Machines Corporation | Storage management through adaptive deduplication |
US8407193B2 (en) | 2010-01-27 | 2013-03-26 | International Business Machines Corporation | Data deduplication for streaming sequential data storage applications |
WO2011129818A1 (en) | 2010-04-13 | 2011-10-20 | Empire Technology Development Llc | Adaptive compression |
CN102143039B (zh) | 2010-06-29 | 2013-11-06 | 华为技术有限公司 | 数据压缩中数据分段方法及设备 |
CA2809224C (en) | 2010-08-31 | 2016-05-17 | Nec Corporation | Storage system |
EP2612443A1 (en) | 2010-09-03 | 2013-07-10 | Loglogic, Inc. | Random access data compression |
RU2467499C2 (ru) * | 2010-09-06 | 2012-11-20 | Государственное образовательное учреждение высшего профессионального образования "Поволжский государственный университет телекоммуникаций и информатики" (ГОУВПО ПГУТИ) | Способ сжатия цифрового потока видеосигнала в телевизионном канале связи |
JP2012164130A (ja) * | 2011-02-07 | 2012-08-30 | Hitachi Solutions Ltd | データ分割プログラム |
EP2652587B1 (en) | 2011-06-07 | 2017-11-15 | Hitachi, Ltd. | Storage system comprising flash memory, and storage control method |
US9026752B1 (en) * | 2011-12-22 | 2015-05-05 | Emc Corporation | Efficiently estimating compression ratio in a deduplicating file system |
US8615499B2 (en) * | 2012-01-27 | 2013-12-24 | International Business Machines Corporation | Estimating data reduction in storage systems |
WO2014030252A1 (ja) | 2012-08-24 | 2014-02-27 | 株式会社日立製作所 | ストレージ装置及びデータ管理方法 |
US9626373B2 (en) * | 2012-10-01 | 2017-04-18 | Western Digital Technologies, Inc. | Optimizing data block size for deduplication |
US10817178B2 (en) * | 2013-10-31 | 2020-10-27 | Hewlett Packard Enterprise Development Lp | Compressing and compacting memory on a memory device wherein compressed memory pages are organized by size |
-
2013
- 2013-08-19 RU RU2015139685A patent/RU2626334C2/ru active
- 2013-08-19 JP JP2015561907A patent/JP6110517B2/ja active Active
- 2013-08-19 CA CA2898667A patent/CA2898667C/en active Active
- 2013-08-19 KR KR1020157021462A patent/KR101653692B1/ko active IP Right Grant
- 2013-08-19 WO PCT/CN2013/081757 patent/WO2015024160A1/zh active Application Filing
- 2013-08-19 CN CN201380003213.3A patent/CN105051724B/zh active Active
- 2013-08-19 EP EP13892074.9A patent/EP2940598B1/en active Active
-
2015
- 2015-07-16 US US14/801,421 patent/US10359939B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080133561A1 (en) * | 2006-12-01 | 2008-06-05 | Nec Laboratories America, Inc. | Methods and systems for quick and efficient data management and/or processing |
CN101706825A (zh) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | 一种基于文件内容类型的重复数据删除方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2940598A4 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112306974A (zh) * | 2019-07-30 | 2021-02-02 | 深信服科技股份有限公司 | 一种数据处理方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
BR112015023973A2 (pt) | 2017-07-18 |
JP2016515250A (ja) | 2016-05-26 |
KR20150104623A (ko) | 2015-09-15 |
RU2626334C2 (ru) | 2017-07-26 |
EP2940598A1 (en) | 2015-11-04 |
EP2940598B1 (en) | 2019-12-04 |
CA2898667A1 (en) | 2015-02-26 |
CA2898667C (en) | 2019-01-15 |
CN105051724A (zh) | 2015-11-11 |
CN105051724B (zh) | 2018-09-28 |
EP2940598A4 (en) | 2016-06-01 |
RU2015139685A (ru) | 2017-03-22 |
US20170017407A1 (en) | 2017-01-19 |
US10359939B2 (en) | 2019-07-23 |
KR101653692B1 (ko) | 2016-09-02 |
JP6110517B2 (ja) | 2017-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015024160A1 (zh) | 一种数据对象处理方法与装置 | |
Kruus et al. | Bimodal content defined chunking for backup streams. | |
US20140244604A1 (en) | Predicting data compressibility using data entropy estimation | |
JP6537214B2 (ja) | 重複排除方法および記憶デバイス | |
US9048862B2 (en) | Systems and methods for selecting data compression for storage data in a storage system | |
US10938961B1 (en) | Systems and methods for data deduplication by generating similarity metrics using sketch computation | |
KR20200024193A (ko) | 데이터 전송의 단일 패스 엔트로피 검출 장치 및 방법 | |
Bhalerao et al. | A survey: On data deduplication for efficiently utilizing cloud storage for big data backups | |
WO2014067063A1 (zh) | 重复数据检索方法及设备 | |
CN113296709B (zh) | 用于去重的方法和设备 | |
WO2013075668A1 (zh) | 重复数据删除方法和装置 | |
US8117343B2 (en) | Landmark chunking of landmarkless regions | |
US20220156233A1 (en) | Systems and methods for sketch computation | |
JP2023510134A (ja) | スケッチ計算のためのシステムおよび方法 | |
US20210191640A1 (en) | Systems and methods for data segment processing | |
Vikraman et al. | A study on various data de-duplication systems | |
Abdulsalam et al. | Evaluation of Two Thresholds Two Divisor Chunking Algorithm Using Rabin Finger print, Adler, and SHA1 Hashing Algorithms | |
CN110968575B (zh) | 一种大数据处理系统的去重方法 | |
Majed et al. | Cloud based industrial file handling and duplication removal using source based deduplication technique | |
BR112015023973B1 (pt) | Método e aparelho de processamento de objeto de dados | |
CN115809013A (zh) | 一种数据重删方法及相关装置 | |
CN114625316A (zh) | 应用在重复数据删除的基于内容分块方法、系统及介质 | |
RANI et al. | A COMBINED STUDY OF HYBRID APPROACHES FOR TEXT AND IMAGE DE-DUPLICATION |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201380003213.3 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13892074 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2898667 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2013892074 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 20157021462 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2015561907 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2015139685 Country of ref document: RU Kind code of ref document: A |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112015023973 Country of ref document: BR |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 112015023973 Country of ref document: BR Kind code of ref document: A2 Effective date: 20150917 |