CN102082575A

CN102082575A - Method for removing repeated data based on pre-blocking and sliding window

Info

Publication number: CN102082575A
Application number: CN2010105858665A
Authority: CN
Inventors: 秦志光; 王亦德; 匡平; 高嵘
Original assignee: JIANGSU GOWOO INFORMATION TECHNOLOGY Co Ltd
Current assignee: JIANGSU GOWOO INFORMATION TECHNOLOGY Co Ltd
Priority date: 2010-12-14
Filing date: 2010-12-14
Publication date: 2011-06-01

Abstract

The invention relates to a method for removing repeated data based on a pre-blocking and a sliding window, which comprises the following steps of: pre-blocking a data object DO into small blocks MC which are not mutually overlapped; then by using the small blocks MC as units, detecting the continuous new blocks MC by using the sliding window and fusing the new blocks MC to a large block SC; and simultaneously reserving the small blocks MC in a new and old data joining part. On the basis, the method adopts different blocking strategies so that under the condition of larger expected values of blocking sizes, higher compression ratio can be obtained and element data expenses can be reduced.

Description

Replicated data eliminating method based on pre-piecemeal and sliding window

Technical field

The present invention relates to the application that repeating data is eliminated, be specifically related to a kind of replicated data eliminating method based on pre-piecemeal and sliding window.

Background technology

For the memory space expense that reduces data and the bandwidth occupancy when reducing the remote data transmission, need compress data usually.Traditional data compression method utilizes data object (Data Object, below being abbreviated as DO) information redundancy of self in-line coding compresses, this compression method is not considered the old version of the DO to be compressed that other DO(in DO to be compressed and the system has for example stored in the system) between relation, so compression ratio (Compression Ratio that obtains, size after CR=DO original size/DO compression) more limited usually, about average out to 2:1, and CR is subjected to the influence of data object coded format very big, for example to binary file and audio/video file, the effect of conventional compression method is very limited.

In a lot of applied environments, a lot of identical data are arranged between the different DO in the system, for example in the standby system between the different backup versions of identical file, between the different released versions of same software, between the mail of mass-sending in the mailing system or the like.The repeating data technology for eliminating utilizes the information redundancy between these DO that data are compressed, and can obtain the CR far above the conventional compression method, and is subjected to the influence of data encoding format very little.The repeating data technology for eliminating uses fixed size piecemeal (Fixed Size Chunking, FSC), content-based elongated piecemeal (Content Defined Chunking, CDC), sliding window piecemeal (Sliding Window Chunking, SWC) etc. method is divided into piecemeal (Chunk) continuous, that do not overlap mutually with DO, calculate the unique identifier (ChunkID) of the cryptographic Hash of data among each Chunk, and it is deposited into a Hash table (HT as this Chunk _ID) in.DO is that unit stores with Chunk, when writing a new DO, removes to mate HT with the ChunkID of each Chunk of forming this DO _IDIn record (be called piecemeal existence inquiry, Chunk Existence Query, CEQ), to the HT that has stored in the system _ID(be HT _IDIn have the ChunkID match) repeated storage no longer, only store new Chunk and ChunkID thereof.As previously mentioned, owing to may have a lot of common data between the DO, so this method can reduce the physical store amount of data greatly.Because the data that repeat do not need to write again, can also reduce the network data transmission amount greatly when teletransmission DO.

FSC determines the border of Chunk with the absolute offset values of distance D O head, and its advantage is that speed is very fast, and the Chunk size is consistent, is convenient to storage device processes.But its fatal shortcoming is that all Chunk after the operating point can be affected to newly-increased very responsive with deletion action.Original CDC(OriginalCDC) moves by Byte in DO with a window (size is 12 Bytes～48 Bytes usually), (for example window Rabin fingerprint value and the preset value D result that carries out modular arithmetic equals-1 to seek certain recurrent pattern, the desired value of Chunk size is by the decision of D value, and this pattern is called Marker) as the Chunk border.Because the border of Chunk is to be determined by the relative position between the Marker, therefore to newly-increased insensitive with deletion action, the Chunk that only has the operating point place usually is affected, and can obtain the CR far above FSC.The shortcoming of OriginalCDC mainly is that the fluctuation of Chunk size is bigger.BaseCDC has introduced Chunk size lower limit C _MinWith Chunk maxsize C _Max, reduced the big minor swing of Chunk, but C _MaxIntroducing can produce hard piecemeal (Chunk _H).Chunk _HThe border be based on that absolute offset values divides, therefore have identical shortcoming with FSC, should avoid Chunk as far as possible _HGeneration.TTTD(Two Threshold Two Divisor) except setting C _Min, C _MaxOutside main modulus D, also preset a standby modulus D ', therefore D '＜D has bigger chance to find the Marker(that meets D ' boundary condition to be designated as D '-Marker, the Marker that meets the D boundary condition correspondingly is designated as D-Marker), if reach C in current C hunk size _MaxThe time also do not run into D-Marker, if D '-Marker is then arranged in this scope, just with it as the Chunk border.TTTD has reduced Chunk when reducing the big minor swing of Chunk _HQuantity.DO is in certain position behind the newly-increased or deleted data, can make this position Chunk afterwards change with respect to the side-play amount of DO head, SWC is by window (the Sliding Window that size is K Bytes that slides by Byte, SW) find out the variation of these side-play amounts, and determine the Chunk border on this basis.SWC can obtain higher CR with respect to CDC, and the Chunk that obtains overwhelming majority size all equals K Bytes, and it is very little to fluctuate.All there is the contradiction between common problem a: CR and the metadata expense in above several method, promptly more little Chunk size desired value, can obtain high more CR, but this can increase the total quantity of Chunk again, thereby significantly increase the metadata expense of Chunk index and management.

Generally speaking, two big parts are all arranged based on the replicated data eliminating method of piecemeal, the one, DO is divided into this process of Chunk(that does not overlap mutually is called Chunking), the 2nd, by CEQ, detect whether each Chunk is repetition among the DO.For stateless method of partition FSC and CDC, Chunking and CEQ be incoherent, separate, Chunking result is only depended on DO itself, with the current state of system (be HT _IDIn stored Chunk situation) irrelevant, therefore in the Chunking process, do not need to carry out CEQ fully, for same DO, at any time it is carried out Chunking, the Chunking result who obtains is always identical.And for the method for partition SWC of state is arranged, Chunking and CEQ are closely related and merge, in the Chunking process, need to carry out a large amount of CEQ, Chunking result is depended on the acting in conjunction of DO itself and system's current state, therefore same DO may be because the difference of system mode obtains different Chunking results.HT _IDUsually very huge, whole graftabls, CEQ just may relate to the magnetic disc i/o operation, so expense is bigger, in network application environment, and HT _IDNormally be kept on the long-range meta data server, this has just more aggravated this problem.SWC is in the Chunking process, each position at the SW place, all to carry out CEQ, and the CEQ return value is under the situation of False, in order to determine the side-play amount on Chunk border, SW moves by Byte, though can effectively improve CR like this, but increased the quantity of CEQ widely, so the time overhead of SWC is very big.This shows that the subject matter that the state method of partition is arranged is the contradiction between CR and the CEQ quantity.

Summary of the invention

Goal of the invention: the present invention has proposed a kind of replicated data eliminating method based on pre-piecemeal and sliding window (hereinafter to be referred as CDSWC) in order to solve contradiction between existing existing CR of method and the metadata expense and the contradiction between CR and the CEQ quantity.

Technical scheme: in order to realize above purpose, a kind of replicated data eliminating method of the present invention based on pre-piecemeal and sliding window, the concrete steps of this method are as follows;

(1) data object DO is carried out pre-piecemeal, it is divided into the little piecemeal MC that does not overlap mutually,

(2) be unit with little piecemeal MC again, use the continuous new little piecemeal MC of sliding window method detection and it is fused to big piecemeal SC; Divide in new legacy data interface simultaneously and keep little piecemeal MC.

Adopt elongated method of partition CDC to carry out piecemeal to data object DO in the described step (1) based on the data object content.

Described sliding window method is:

(a) set sliding window SSW; Set sliding window SSW and form by X little piecemeal MC, and from the head of the data object DO described sliding window SSW that begins to slide;

(b) remaining little piecemeal RMC number L and the X of being untreated among the data object DO compared;

If L=X calculates the SHA-1 cryptographic Hash of data among the sliding window SSW, and sliding window SSW is carried out the piecemeal existence inquire about CEQ; If piecemeal existence inquiry CEQ query display result is true, keep the border of the little piecemeal RMC that is untreated, X little piecemeal MC of output preserves new little piecemeal MC;

If L＜X, if sliding window SSW is carried out piecemeal existence inquiry CEQ, query display result is true, the little piecemeal RMC that will be untreated is output as a current R MC, each little piecemeal MC is carried out preserving behind the piecemeal existence inquiry CEQ, and X the little piecemeal MC that will form sliding window SSW merges into the big piecemeal SC output of a repetition, judges inquiry again after then sliding window SSW being slided backward the distance of X little piecemeal MC;

If sliding window SSW is carried out piecemeal existence inquiry CEQ, query display result is false, and the distance back that then sliding window SSW is slided backward a little piecemeal MC is being advanced relatively to judge to L and X.

(c) except (b) described situation, all the data in the described sliding window SSW are kept original border under all the other situations, be output as several little piecemeal MC.

Replicated data eliminating method based on pre-piecemeal and sliding window of the present invention has adopted following two criterion: ⅰ when merging little piecemeal MC. often together the continuous data of appearance be divided into big piecemeal SC; ⅱ. the intersection at new data and legacy data adopts little piecemeal MC.In a lot of applied environments, have an important relationship characteristic between the continuous version of data object DO usually: the size of whole relatively data object DO, the overwhelming majority of data object DO is changed, and often concentrates in the less relatively zone.For example, in a lot of file system, the rare variation of most of file, often the file that changes only accounts for the fraction of whole file set, therefore in continuous a plurality of Backup Images of file system, the data that change concentrate in the less zone of Backup Images usually.Therefore much drop in the continuous data data object DO version afterwards outside the data variation zone, long and also usually can repeat.Because these long continuous datas are not in the data variation zone, therefore,, these data can not produce excessive border expense even being divided into big piecemeal SC preservation yet.Here the border expense is meant owing to divide deviation between block boundary and the actual new data border, and new minute block size that causes and the difference between the actual new data size.Generally speaking, the size of new data place piecemeal is more little, and the border expense is just more little.Intersection at new legacy data adopts little piecemeal MC, also can reach the purpose that reduces border expense, increasing compression ratio.

Beneficial effect: the present invention compared with prior art has the following advantages:

The present invention is owing to adopted different partition strategies in the data movement zone with non-variable domain, therefore under bigger piecemeal desired value situation, still can obtain compression ratio preferably, and because sliding window SSW is that unit slides with little piecemeal MC, therefore can reduce CEQ quantity significantly, thereby reduce time overhead.

Description of drawings

Fig. 1 is the schematic diagram that new legacy data intersection adopts little piecemeal among the present invention.

Fig. 2 is an example of the present invention schematic diagram.

Embodiment

Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment only is used to the present invention is described and is not used in and limit the scope of the invention, after having read the present invention, those skilled in the art all fall within the application's claims institute restricted portion to the modification of the various equivalent form of values of the present invention.

The main process of the CDSWC method that the present invention proposes is as follows.

ⅰ. use elongated method of partition CDC(such as TTTD based on the data object content) data object DO is carried out pre-piecemeal (Pre-Chunking), data object DO is divided into the little piecemeal MC that does not overlap mutually, the border of writing down each little piecemeal MC.

ⅱ. initialization flag, set sliding window SSW, form by X little piecemeal MC, and begin the SSW that slides from the head of DO.

ⅲ. whether remaining untreated little piecemeal MC number L is less than X, if then forward step ⅷ among the judgment data object DO.

ⅳ. judge the little piecemeal RMC(Residue Mini Chunk that is untreated that sliding window SSW slips over, whether amount R RMC) has equaled X, if then calculate the SHA-1 cryptographic Hash of data among the SSW, and sliding window SSW is carried out the piecemeal existence inquire about CEQ (SSW): if CEQ (SSW)=True forwards step ⅴ to; If CEQ (SSW)=False then forwards step ⅵ to.If R less than X, then forwards step ⅶ to.

ⅴ. the little piecemeal RMC that will be untreated keeps original border, be output as X MC, each little piecemeal MC is carried out preserving (having only new little piecemeal MC just to preserve) behind the piecemeal existence inquiry CEQ, and X the little piecemeal MC that will form sliding window SSW merges into the big piecemeal SC output of a repetition, IsPreDupSC sign (whether the last Chunk that is used to identify RMC is the SC of repetition) is set to True, then sliding window SSW is slided backward the distance of X little piecemeal MC, forward step ⅲ to.

ⅵ. check the IsPreDupSC sign: if IsPreDupSC=True, the little piecemeal RMC that will be untreated is output as X little piecemeal MC, and each little piecemeal MC is carried out preserving behind the piecemeal existence inquiry CEQ; If IsPreDupSC=False merges into a new big piecemeal SC output and a preservation with each little piecemeal RMC.The IsPreDupSC sign is set to False then, and with the distance that sliding window SSW slides backward a little piecemeal MC, forwards step ⅲ to.

ⅶ. sliding window SSW is carried out piecemeal existence inquiry CEQ (SSW): if CEQ (SSW)=True, IsPreDupSC is set is masked as True, RMC is output as R little piecemeal MC(as R〉0 the time), each little piecemeal MC is carried out preserving behind the piecemeal existence inquiry CEQ, and X the little piecemeal MC that will form sliding window SSW merges into the big piecemeal SC output of a repetition, forwards step ⅲ to after then sliding window SSW being slided backward the distance of X little piecemeal MC; If piecemeal existence inquiry CEQ(SSW)=and False, forward step ⅲ to after then sliding window SSW being slided backward the distance of a little piecemeal MC.

If ⅷ. L〉0, then to the individual little piecemeal MC(Last Mini Chunk of the L at data object DO end, LMC) carrying out the piecemeal existence inquires about CEQ (LMC) and checks the IsPreDupSC sign: only when CEQ (LMC)=False and IsPreDupSC=True, LMC is output as L little piecemeal MC, each little piecemeal MC is carried out piecemeal existence inquiry CEQ (MC) back preserve; Otherwise LMC is merged into a big piecemeal SC output,, then preserve if this big piecemeal SC is new.Finish computing then.

Fig. 2 has provided the example of a CDSWC method, and the border that obtains MC behind the Pre-Chunking as shown in phantom in FIG..Set X=3, begin the SSW that slides from the head of DO, when SSW is positioned at the A position, CEQ (SSWA)=True merges into SC1 output (SC1 is the SC of repetition) with MCa, MCb and MCc, and IsPreDupSC=True is set, then SSW is slided backward the distance of 3 MC, arrive the B position.CEQ (SSWA)=False slides backward the distance of 1 MC with SSW, and in C position and D position, CEQ (SSW) is False.When SSW slides into the E position, R=3, CEQ (SSWE)=False, IsPreDupSC=True, be that current RMC(is made up of MCd, MCe and MCf) in data to do as a whole be that CEQ (SSWB)=False before the new SC(has guaranteed this point), and its last Chunk is the SC of repetition, illustrates that this RMC is the boundary part of new legacy data.Therefore it is output as MC2, MC3 and MC4, these 3 MC is carried out CEQ(MC) back preservation (these 3 MC of hypothesis are new among the figure), IsPreDupSC=False is set then, and SSW is slided backward carrying out apart from continuing of a MC.When SSW slides into the H position, R=3, CEQ (SSWH)=False, IsPreDupSC=False, therefore current RMC(is made up of MCg, MCh and MCi) merge into SC5 and export (SC5 is new SC), IsPreDupSC=False is set, and SSW is slided backward the distance continuation execution of a MC.In like manner, MCj, MCk and MCl to be merged into SC6(SC6 be new SC).When SSW slides into position M, R=2, CEQ (SSWM)=True, therefore current RMC(is made up of MCm and MCn) be output as MC7 and MC8, and MCo, MCp and MCq merged into SC9 output (SC9 is the SC of repetition), IsPreDupSC=True is set then, and the distance that SSW is slided backward 3 MC arrives the N position.In the N position, L=2＜3, CEQ(LMC)=and True, therefore MCr and MCs are merged into SC10 output (SC10 is the SC of repetition), computing finishes.

Claims

1. replicated data eliminating method based on pre-piecemeal and sliding window, it is characterized in that: the concrete steps of this method are as follows;

2. the replicated data eliminating method based on pre-piecemeal and sliding window according to claim 1 is characterized in that: adopt the elongated method of partition CDC based on the data object content to carry out piecemeal to data object DO in the described step (1).

3. the replicated data eliminating method based on pre-piecemeal and sliding window according to claim 1 is characterized in that: described sliding window method is:

(a) set sliding window SSW; Set sliding window SSW and form by X little piecemeal MC, and from the head of the data object DO described sliding window SSW that begins to slide,

If sliding window SSW is carried out piecemeal existence inquiry CEQ, query display result is false, and the distance back that then sliding window SSW is slided backward a little piecemeal MC is being advanced relatively to judge to L and X;