CN105515586A

CN105515586A - Rapid delta compression method

Info

Publication number: CN105515586A
Application number: CN201510927001.5A
Authority: CN
Inventors: 夏文; 冯丹; 李春光; 江泓
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2016-04-20
Anticipated expiration: 2035-12-14
Also published as: CN105515586B

Abstract

The invention discloses a rapid delta compression method, which comprises the following steps of carrying out content-based rapid segmentation on a reference block B in delta compression to obtain a plurality of words in order to form a word library; carrying out content-based rapid segmentation on a reference block A similar to the reference block B and amplifying repeated words detected during rapid segmentation to obtain repeated words and non-repeated words; encoding and storing the obtained repeated words and non-repeated words in sequence according to a segmentation order and respectively utilizing two different data formats to record the repeated words and the non-repeated words in order to obtain a delta data block deltaB,A; and when a decoding operation on the delta data block deltaB,A is needed, obtaining records of two data formats from the deltaB,A in sequence in order to obtain all words of the data block A in sequence, and writing the words into an output stream in sequence to restore the complete data block A. The method has the advantages of high search efficiency for the repeated words, small computing cost, high data compression efficiency and the like.

Description

A kind of residual quantity compression method fast

Technical field

The invention belongs to the field of data compression of Computer Storage, more specifically, relate to a kind of residual quantity compression method fast.

Background technology

In recent years, along with the development of computer technology and network is popularized, the data information memory amount in the whole world is the trend of explosive growth.Although the price of memory device is ceaselessly declining always, be also unable to catch up with the speed that data augmentation increases far away.Data de-duplication (DataDeduplication), as a kind of technology by effectively eliminating redundant data on a large scale, becomes the focus of storage system research in the last few years.Data de-duplication greatly can not only be saved memory space thus improve the utilance of storage resources, and can by the efficiency of transmission avoiding the transmission of redundant data to improve the network bandwidth.

But along with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Because traditional data de-duplication technology carries out repeating data judgement based on the fingerprint of data block, so which has limited data de-duplication technology can only identify the data block repeated completely, and those very similar data blocks can not be identified.Such as two data block A1 and the A2 situation that only several byte is different, although A1 with A2 is close to completely similar, data de-duplication technology can produce distinct data fingerprint thus have ignored the redundancy process to these set of metadata of similar data.So residual quantity compression (DeltaCompression) technology is just suggested and is applied in this occasion, residual quantity compression is an efficient data compression technique, and it can according to reference data block (or citing approvingly by data block) A _rto its set of metadata of similar data block A _icarry out high compression.The similarity of data block is higher, then compression efficiency is higher.As shown by the equation, A _rand A _iinput Delta algorithm device, Delta algorithm device can export a differential data and (be denoted as △ _r,i) representation file A _icompressed version.As needs decompressed data A _i, then differential data △ is read _r,iwith reference data block A _rnamely data A can be calculated _i.

So comparatively speaking, but residual quantity compression can eliminate non-duplicate similar redundant data, thus obtains larger data compression ratio for residual quantity compression and data de-duplication technology.

But existing residual quantity compress technique exists following problem: its compressed encoding is slow, index expense is large, and efficiency of data compression is low, poor expandability; For the Xdelta residual quantity compression algorithm of the Univ. of California, Berkeley classics extensively adopted at present, its compressed encoding speed is approximately only 30-60MB/s (adopting intel tetra-core Xeon2.6Ghz processor), and residual quantity compressed encoding speed seriously limits popularization and the development of this algorithm so slowly.

Summary of the invention

For above defect or the Improvement requirement of prior art, the invention provides a kind of residual quantity compression method fast, its object is to, by by content-based for set of metadata of similar data block (or file) quick cutting word, calculate the operations such as word Hash, index search repeated word, identify the different pieces of information between set of metadata of similar data block, realize final residual quantity code storage, thus saving memory space, and solve that the compressed encoding that exists in existing residual quantity compress technique is slow, index expense is large, efficiency of data compression is low, the technical problem of poor expandability.

For achieving the above object, according to one aspect of the present invention, provide a kind of residual quantity compression method fast, comprise the following steps:

(1) content-based quick cutting is carried out to the reference block B in residual quantity compression, to obtain multiple word, thus form word library;

(2) content-based quick cutting is carried out to the data block A similar to reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word;

(3) repeated word obtained in step (2) and non-duplicate word encoded successively by cutting order and store, and using two kinds of different data formats to record repeated word and non-duplicate word, to obtain differential data block △ respectively _b,A;

(4) at needs to differential data block △ _b,Awhen carrying out decode operation, successively from △ _b,Athe record of middle acquisition two kinds of data formats, thus all words obtaining data block A successively, by these order of words write output stream, to recover complete data block A.

Preferably, step (1) comprises following sub-step:

(1-1) initialization quick sliding cryptographic Hash f=1, arranges the current sliding position i=1 of reference block B;

(1-2) Calculation Basis block B is at the quick sliding cryptographic Hash f at current sliding position i place, f=(f<<1)+B _i, wherein B _irepresent the byte content of reference block B at current sliding position i place;

(1-3) whether the quick sliding cryptographic Hash f that determining step (1-2) calculates meets a minimum p position is all 0, is, enters step (1-4), otherwise enters step (1-5);

(1-4) mark position i is the end of a word, and utilizes fingerprint algorithm to this word calculated fingerprint, to set up the fingerprint index of this word, arranges f=1, enters step (1-5);

(1-5) i=i+1 is set, and repeated execution of steps (1-2) and (1-3), until process last byte of data block B.

Preferably, step (2) comprises following sub-step:

(2-1) initialization quick sliding cryptographic Hash g=1, the current sliding position j=1 of setting data block A;

(2-2) calculated data block A is at the quick sliding cryptographic Hash g at current location j place, g=(g<<1)+A _j, wherein A _irepresent the byte content of data block A at current sliding position j place;

(2-3) whether the quick sliding cryptographic Hash g that determining step (2-2) calculates meets a minimum p position is all 0, is, enters step (2-4); Otherwise proceed to step (2-6);

(2-4) mark position j is the end of a word W, and utilizes the fingerprint algorithm identical with step (1-4) to this word W calculated fingerprint h;

(2-5) search the word library of reference block B according to fingerprint h, detect whether W is repeated word by fingerprint index, if then enter step (2-5); Otherwise mark word W is non-duplicate word, arranges f=1, enters step (2-6);

(2-6) marking word W is repeated word, and whether the byte content continuing the repeated word V of comparison word W with W in reference block B follow-up is identical, once run into not identical byte, just stop comparison, finally obtain follow-up identical byte number k, the length arranging word W increases k, arranges j=j+k, f=1, and return step (2-2);

(2-7) j=j+1 is set, and repeated execution of steps (2-2) and (2-3), until process last byte of data block A.

Preferably, step (3) is specially, use the position of ' 0 ' data format record repeated word in reference block B and length information, use length information and the byte content of ' 1 ' data format record non-duplicate word, to obtain the differential data block △ of data block A for reference block B _b,A.

Preferably, step (4) is specially, and for the record of ' 0 ' data format, according to position and length information, from reference block B, obtains repeated word; For the record of ' 1 ' data format, directly from this record, take out non-duplicate word.

In general, the above technical scheme conceived by the present invention compared with prior art, can obtain following beneficial effect:

1, the present invention is by the method (i.e. step (1-2), (1-3), (2-2) and (2-3)) of content-based quick cutting word, each slip only needs a shift left operation and an add operation, reach cutting word effect fast, it also avoid the word having overlapped contents in a large number that conventional compression coding produces simultaneously, simplify the word matched process in residual quantity compression encoding process, decrease corresponding internal memory and computing cost, and very high residual quantity compression efficiency can be ensured.

Whether 2, the present invention amplifies strategy by repeated word, directly repeat the repeated word comparison content backward detected, avoid the calculating operations such as the piecemeal to repeated word further part byte, Hash, retrieval, accelerate residual quantity compression encoding process; Simultaneously because repeated word amplifies the record number decreased in differential data, so also accelerate the process of decoding.And the method can safeguard the locality of repeated word, ensure higher efficiency of data compression.

3, content-based quick cutting word method of the present invention helps repeated word amplification method to locate rapidly and has looked for repeated word, and repeated word amplification method then helps residual quantity to compress the calculating such as piecemeal, Hash reducing further part.So the combination of both brings out the best in each other, obtain very fast residual quantity compressed encoding speed.The more duplicate contents that the repeated word of repeated word amplification method greed detection substantially is simultaneously follow-up, has ensured higher residual quantity compression efficiency.

4, the merging non-duplicate word that the present invention proposes carries out the process of encoding, and for multiple continuous print non-duplicate word, adopts once record, accelerates the process of residual quantity compressed encoding and decoding.

Accompanying drawing explanation

Fig. 1 is the flow chart of the quick residual quantity compaction coding method of the present invention.

Fig. 2 is the quick sliding hash method schematic diagram that the present invention adopts.

Fig. 3 be the content-based piecemeal that adopts of the present invention with conventional compression method compare schematic diagram.

Fig. 4 is the word amplification method schematic diagram that the present invention adopts.

Fig. 5 is the coded format schematic diagram that the present invention adopts.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.In addition, if below in described each execution mode of the present invention involved technical characteristic do not form conflict each other and just can mutually combine.

As shown in Figure 1, residual quantity compression of the present invention mainly comprises three parts, the residual quantity coding of content-based quick cutting word, repeated word amplification method, merging non-duplicate word.

The content-based quick cutting word method that the present invention proposes, avoids conventional compression coding and produces a large amount of words having overlapped contents, simplify the word matched process in residual quantity compression encoding process.As shown in Figure 2, traditional residual quantity compression method creates a large amount of overlapping character strings, finds more repeated word with hope; The content-based quick cutting word method that the present invention proposes, creates less word, the position offset problem simultaneously content modification also being avoided to cause: namely ensure that having arrived identical content can produce same word cut point.The advantage of the method: each slip only needs a sub-addition and a shift left operation, internal memory and computing cost little, very high residual quantity compression efficiency can be ensured simultaneously.

The repeated word amplification method that the present invention proposes, directly to the repeated word detected matching content backward, just can avoid piecemeal and fingerprint calculating, the index operation etc. of magnification region.The method launches based on the word locality characteristic of data flow, and the word locality in storage system refers to, when word once occurs with sequence A, B, C, when so there is word A next time, after word B and C probably can follow closely.The present invention excavates the word locality of this data flow to carry out the work of residual quantity compressed encoding, as shown in Figure 4, and the word sequence for the twice set of metadata of similar data block in front and back: B ₁, B ₂, B ₃, B ₄, B ₅and A ₁, A ₂, A ₃, A ₄, A ₅, adopt the method for content-based quick cutting word to determine B ₂and A ₂repeat.According to aforesaid principle of locality, word B ₃, B ₄, B ₅very possible respectively with word A ₃, A ₄, A ₅repeat respectively, such B ₃, B ₄, B ₅and A ₃, A ₄, A ₅the operation such as piecemeal, Hash, index consuming time just there is no need, only need word from B ₂and A ₂comparison expands { B to backward ₂, B ₃, B ₄, B ₅and { A ₂, A ₃, A ₄, A ₅.Meanwhile, the method that repeated word amplifies is not limited to the border that content-based word divides, and more duplicate contents, in the position of certain word inside, so just can be found in the position that may finally stop.As long as so this repeated word simple comparison subsequent content of method of amplifying, thus simplify and accelerate the encoding operation of residual quantity compression, and ensureing the efficiency of data compression that acquisition is higher.

The residual quantity coding of the merging non-duplicate word that the present invention proposes, as shown in Figure 5, for continuous print non-duplicate word, is merged into a word coding, improves residual quantity compression efficiency, also can accelerate the decode procedure of residual quantity compression.Meanwhile, according to the combination of above-mentioned multinomial technology, greatly accelerate the Code And Decode process of residual quantity compression, ensured efficiency of data compression simultaneously.

As shown in Figure 1, quick residual quantity compression method of the present invention, is carry out quick residual quantity compressed encoding for known data block (or file) A and the reference data block B similar to it, comprises following concrete steps:

(1) content-based quick cutting (FastRolling) is carried out to the reference block B in residual quantity compression, to obtain multiple word, thus form word library; This step specifically comprises following sub-step:

(1-2) Calculation Basis block B is at the quick sliding cryptographic Hash f at current sliding position i place, f=(f<<1)+B _i, wherein B _irepresent the byte content of reference block B at current sliding position i place, as shown in Figure 2;

By a shift left operation and an add operation, this hash algorithm both ensure that (namely this cryptographic Hash can obtain from cryptographic Hash last time for the function of slip Hash, old byte content progressively will remove along with shift left operation from cryptographic Hash, new byte content enters new cryptographic Hash by additional calculation), possesses again computational speed fast, the features such as cryptographic Hash randomness is strong simultaneously.

(1-3) whether the quick sliding cryptographic Hash f that determining step (1-2) calculates meets a minimum p position is all 0 (namely judge f & (2 ^p-1) whether value is 0; The value of p meets following relational expression: the average length of the word obtained after cutting reference block B is approximately equal to 2 ^pindividual byte, such as, if word average length is approximately 64, then p=6; General 4<=p<=13, namely word average length controls the scope between 16 bytes to 8192 bytes), be enter step (1-4), otherwise enter step (1-5);

Here content-based quick cutting word method ensure that and can produce identical word cut point on the position of identical content.Can control residual quantity compression effectiveness by adjustment word average length (arranging p is different numerical value), generally speaking, word average length arranges larger, and the granularity of residual quantity compression process is larger, and calculate faster, compression ratio is lower; Word average length arranges less, and the granularity of residual quantity compression process is less, and calculate slower, compression ratio is also higher.

As shown in Figure 3, the word that content-based quick cutting word creates a small amount of number is searched for repeated word, avoids the word having overlapped contents in a large number that conventional compression coding produces, simplifies the process of the word matched in residual quantity compressed encoding.

(1-4) mark position i is the end of a word, and utilize fingerprint algorithm to this word calculated fingerprint (in the present embodiment, the fingerprint algorithm of employing is xxHash), to set up the fingerprint index of this word, f=1 is set, enters step (1-5);

(2) content-based quick cutting is carried out to the data block A similar to reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word; This step specifically comprises following sub-step:

(2-2) calculated data block A is at the quick sliding cryptographic Hash g at current location j place, g=g<<1+A _j, wherein A _irepresent the byte content of data block A at current sliding position j place;

(2-3) whether the quick sliding cryptographic Hash g that determining step (2-2) calculates meets a minimum p position is all 0 (namely judge f & (2 ^p-1) whether value is 0, and p value is identical with step (1-3) here), be enter step (2-4); Otherwise proceed to step (2-6);

(2-4) mark position j is the end of a word W, and utilizes the fingerprint algorithm identical with step (1-4) to this word W calculated fingerprint h (in the present embodiment, the fingerprint algorithm of employing is xxHash);

(2-5) word library of reference block B is searched according to fingerprint h, detect whether W is repeated word by fingerprint index, if (h exists in fingerprint index, and word V corresponding in reference block B also mates completely with word W byte content), then enter step (2-5); Otherwise mark word W is non-duplicate word, arranges f=1, enters step (2-6);

(2-6) marking word W is repeated word, and whether the byte content continuing comparison word W with V follow-up is identical, once run into not identical byte, just stops comparison (as shown in Figure 4).Finally obtain follow-up identical byte number k, the length arranging word W increases k; J=j+k is set, f=1, and returns step (2-2);

Whether the byte content continuing comparison repeated word W with V follow-up is identical, is namely the repeated word amplification method using the present invention to propose.The region of amplifying does not need to carry out the operations such as time-consuming content-based piecemeal, fingerprint calculating, fingerprint index, saves time overhead.

(3) repeated word obtained in step (2) and non-duplicate word encoded successively by cutting order and store, using two kinds of different data formats to record repeated word and non-duplicate word, to obtain differential data block △ respectively _b,A; Specifically, as shown in Figure 5, use ' 0 ' and ' 1 ' acute pyogenic infection of finger tip two kinds of data formats respectively, use the position of ' 0 ' data format record repeated word in reference block B and length information, use the length information of ' 1 ' data format record non-duplicate word and byte content, thus obtain data block A (△ is denoted as the differential data block of reference block B _b,A).

(4) at needs to differential data block △ _b,Awhen carrying out decode operation, successively from △ _b,Athe record of middle acquisition two kinds of data formats, thus all words obtaining data block A successively, by these order of words write output stream, to recover complete data block A; Specifically, as shown in Figure 5, for the record of ' 0 ' data format, according to position and length information, from reference block B, repeated word is obtained; For the record of ' 1 ' data format, directly from this record, take out non-duplicate word.

Generally speaking, the present invention can accelerate residual quantity compression encoding process, and it is fast to have repeated word search efficiency, the little and efficiency of data compression advantages of higher of computing cost.

Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a quick residual quantity compression method, comprises the following steps:

2. quick residual quantity compression method according to claim 1, is characterized in that, step (1) comprises following sub-step:

3. quick residual quantity compression method according to claim 1, is characterized in that, step (2) comprises following sub-step:

(2-3) whether the quick sliding cryptographic Hash g that determining step (2-2) calculates meets a minimum p position is all 0, is, enters step (2-4), otherwise proceeds to step (2-6);

4. quick residual quantity compression method according to claim 1, it is characterized in that, step (3) is specially, use the position of ' 0 ' data format record repeated word in reference block B and length information, use length information and the byte content of ' 1 ' data format record non-duplicate word, to obtain the differential data block △ of data block A for reference block B _b,A.

5. quick residual quantity compression method according to claim 1, is characterized in that, step (4) is specially, and for the record of ' 0 ' data format, according to position and length information, from reference block B, obtains repeated word; For the record of ' 1 ' data format, directly from this record, take out non-duplicate word.