CN105515586B

CN105515586B - A kind of quick residual quantity compression method

Info

Publication number: CN105515586B
Application number: CN201510927001.5A
Authority: CN
Inventors: 夏文; 冯丹; 李春光; 江泓
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2019-04-12
Anticipated expiration: 2035-12-14
Also published as: CN105515586A

Abstract

The invention discloses a kind of quick residual quantity compression methods, it include: that the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, to constitute word library, quick cutting based on content is carried out to data block A similar with reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word, obtained repeated word and non-duplicate word are successively encoded and stored by cutting sequence, and repeated word and non-duplicate word are recorded using two different data formats respectively, to obtain differential data block △_B,A, needing to differential data block △_B,AWhen being decoded operation, successively from △_B,ATo successively obtain all words of data block A output stream is written, to recover complete data block A in these order of words by the middle record for obtaining two kinds of data formats.The present invention has many advantages, such as that repeated word search efficiency is fast, and computing cost is small and efficiency of data compression is high.

Description

A kind of quick residual quantity compression method

Technical field

The invention belongs to the field of data compression of computer storage, more particularly, to a kind of quick residual quantity compression method.

Background technique

In recent years, as the development of computer technology and network is universal, global data information memory amount is in explosive increasing Long trend.Although the price of storage equipment is ceaselessly declining always, also much it is unable to catch up with the speed of data augmentation growth. Data de-duplication (Data Deduplication) as it is a kind of by a large scale effectively eliminate redundant data technology, Hot spot as storage system research in recent years.Data de-duplication can not only save memory space greatly to improve The utilization rate of storage resource, and can be by avoiding the transmission of redundant data from improving the efficiency of transmission of network bandwidth.

But with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Due to Traditional data de-duplication technology is the fingerprint based on data block to carry out repeated data judgement, so which has limited repeat numbers Complete duplicate data block can only be identified according to deleting technique, and cannot identify those much like data blocks.Such as two data The block A1 and A2 only different situation of several bytes, although A1 and A2, close to completely similar, data de-duplication technology can produce Raw completely different data fingerprint is to have ignored the redundancy processing to these set of metadata of similar data.Then residual quantity compresses (Delta Compression) technology, which is just suggested, applies in this occasion, and residual quantity compression is an efficient data compression technique, it It can be according to reference data block (or reference data block) A_rTo its set of metadata of similar data block A_iCarry out high compression.Data block it is similar Degree is higher, then compression efficiency is higher.As shown by the equation, A_rAnd A_iDelta algorithm device is inputted, Delta algorithm device can export one Differential data (is denoted as Δ_r,i) represent file A_iCompressed version.It such as needs to decompress data A_i, then differential data Δ is read_r,iAnd base Quasi- data block A_rData A can be calculated_i。

So residual quantity compression is with data de-duplication technology comparatively, residual quantity compression can eliminate non-duplicate but phase As redundant data, to obtain bigger data compression ratio.

However, existing residual quantity compress technique has the following problems: its compressed encoding is slow, and index expense is big, data compression Low efficiency, poor expandability；By taking the Xdelta residual quantity compression algorithm for the Univ. of California, Berkeley classics being widely used at present as an example, Its compressed encoding rate is slowly poor so only about 30-60MB/s (using tetra- core Xeon 2.6Ghz processor of intel) Amount compressed encoding rate seriously limits the popularization and development of the algorithm.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of quick residual quantity compression method, Purpose is, by the way that set of metadata of similar data block (or file) is based on the quick cutting word of content, calculates word Hash, index lookup weight The operations such as multiple word, identify the different data between set of metadata of similar data block, final residual quantity code storage are realized, to save storage Space, and solve that compressed encoding present in existing residual quantity compress technique is slow, index expense is big, efficiency of data compression is low, can expand The technical problem of malleability difference.

To achieve the above object, according to one aspect of the present invention, a kind of quick residual quantity compression method is provided, including with Lower step:

(1) the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, thus structure At word library；

(2) the quick cutting based on content is carried out to data block A similar with reference block B, and to during quick cutting The repeated word detected amplifies, to obtain repeated word and non-duplicate word；

(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and are stored by cutting sequence, And repeated word and non-duplicate word are recorded using two different data formats respectively, to obtain differential data block Δ_B,A；

(4) it is needing to differential data block Δ_B,AWhen being decoded operation, successively from Δ_B,ATwo kinds of data formats of middle acquisition Record output stream is written into these order of words to successively obtain all words of data block A, it is complete to recover Data block A.

Preferably, step (1) includes following sub-step:

(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged；

Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position i_i, wherein B_i Indicate byte content of the reference block B at current sliding position i；

It is all 0 that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, is (1-4) is then entered step, (1-5) is otherwise entered step；

(1-4) mark position i is the end of a word, and calculates fingerprint to the word using fingerprint algorithm, to establish The fingerprint index of the word is arranged f=1, enters step (1-5)；

I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), until handled data block B last Until a byte.

Preferably, step (2) includes following sub-step:

(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged；

(2-2) calculates quick sliding cryptographic Hash g, g=(g < < the 1)+A of data block A at the j of current location_j, wherein A_jTable Show byte content of the data block A at current sliding position j；

It is all 0 that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, is Then enter step (2-4)；Otherwise it is transferred to step (2-7)；

(2-4) mark position j is the end of a word W, and is utilized with identical fingerprint algorithm in step (1-4) to this Word W calculates fingerprint h；

(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if It is to enter step (2-6)；Otherwise label word W is non-duplicate word, and f=1 is arranged, enters step (2-7)；

It is repeated word that (2-6), which marks word W, and it is subsequent to continue repeated word V of the comparison word W and W in reference block B Byte content it is whether identical, once as soon as encounter a different byte, stop compare, finally obtain subsequent identical byte The length of number k, setting word W increase k, and j=j+k, f=1, and return step (2-2) is arranged；

J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), until handled data block A last Until a byte.

Preferably, step (3) is specifically, use position and length of ' 0 ' the data format record repeated word in reference block B Information is spent, uses ' 1 ' data format to record the length information and byte content of non-duplicate word, to obtain data block A for base The differential data block Δ of quasi- block B_B,A。

Preferably, step (4) is specifically, record for ' 0 ' data format, according to position and length information, from benchmark Repeated word is obtained in block B；For the record of ' 1 ' data format, non-duplicate word is directly taken out from the record.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:

1, the present invention passes through method (i.e. step (1-2), (1-3), (2-2) and (2- based on the quick cutting word of content 3)), sliding only needs a shift left operation and an add operation every time, has reached quick cutting word effect, while The word for largely there are overlapped contents that conventional compression coding generates is avoided, the word in residual quantity compression encoding process is simplified With process, reduce corresponding memory and computing cost, and can guarantee very high residual quantity compression efficiency.

2, the present invention amplifies strategy by repeated word, directly compares whether content weighs backward to the repeated word detected It is multiple, it avoids to calculating operations such as the piecemeals, Hash, retrieval of repeated word further part byte, accelerates residual quantity compressed encoding Process；Simultaneously because repeated word amplification reduces the record number in differential data, so also accelerating decoded process.And And this method can safeguard the locality of repeated word, ensure higher efficiency of data compression.

3, of the invention positioned rapidly based on the quick cutting word method help repeated word amplification method of content has looked for weight Multiple word, and repeated word amplification method then helps residual quantity compression to reduce the calculating such as piecemeal, Hash of further part.So this two The combination of person brings out the best in each other, and obtains cracking residual quantity compressed encoding speed.Repeated word amplification method is maximumlly coveted simultaneously The heart detects the subsequent more duplicate contents of repeated word, has ensured higher residual quantity compression efficiency.

4, the process proposed by the present invention for merging non-duplicate word and being encoded, for multiple continuous non-duplicate words, Using primary record, residual quantity compressed encoding and decoded process are accelerated.

Detailed description of the invention

Fig. 1 is the flow chart of quickly residual quantity compaction coding method of the invention.

Fig. 2 is the quick sliding hash method schematic diagram that the present invention uses.

Fig. 3 is the comparison schematic diagram based on content piecemeal Yu conventional compression method that the present invention uses.

Fig. 4 is the word amplification method schematic diagram that the present invention uses.

Fig. 5 is the coded format schematic diagram that the present invention uses.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

As shown in Figure 1, residual quantity compression of the invention mainly includes three parts, based on the quick cutting word of content, repeat Word amplification method, the residual quantity coding for merging non-duplicate word.

Proposed by the present invention to be based on the quick cutting word method of content, avoiding conventional compression coding and generating largely has weight The word of folded content, simplifies the word matched process in residual quantity compression encoding process.As shown in Fig. 2, traditional residual quantity is compressed Method produces a large amount of overlapping character strings, to wish to find more repeated words；It is proposed by the present invention quick based on content Cutting word method produces less word, while being also avoided that position offset problem caused by content modification: guaranteeing to arrive Identical content can generate same word cut point.The advantages of this method: sliding only needs a sub-addition and moves to left every time Operation, memory and computing cost are small, while can guarantee very high residual quantity compression efficiency.

Repeated word amplification method proposed by the present invention, directly to the repeated word detected matching content backward, so that it may To avoid piecemeal and the fingerprint calculating of magnification region, index operation etc..This method is the word locality characteristic based on data flow Come what is be unfolded, the word locality in storage system refers to, when word once occurs with sequence A, B, C, then next time occurs When word A, word B and C are likely to follow closely below.The present invention excavates the word locality of this data flow to carry out residual quantity Compressed encoding work, as shown in figure 4, for the word sequence of front and back set of metadata of similar data block twice: B₁、B₂、B₃、B₄、B₅And A₁、A₂、 A₃、A₄、A₅, B can be determined using the method based on the quick cutting word of content₂And A₂It repeats.According to principle of locality above-mentioned, Word B₃、B₄、B₅Probably respectively with word A₃、A₄、A₅It repeats respectively, such B₃、B₄、B₅And A₃、A₄、A₅Time-consuming point The operations such as block, Hash, index are just not necessarily to, it is only necessary to by word from B₂And A₂It compares backward and is expanded to { B₂、B₃、B₄、B₅} { A₂、A₃、A₄、A₅}.Meanwhile the method for repeated word amplification is not limited to the boundary divided based on content words, it may be final Position of the position of stopping inside some word, can thus find more duplicate contents.So this repeated word As long as the method for amplification simply compares subsequent content, to simplify and speed up the encoding operation of residual quantity compression, and guarantee to obtain Higher efficiency of data compression.

The residual quantity coding proposed by the present invention for merging non-duplicate word, as shown in figure 5, for continuous non-duplicate word, It is merged into a word coding, residual quantity compression efficiency is improved, can also accelerate the decoding process of residual quantity compression.Meanwhile according to upper The combination for stating multinomial technology, has significantly speeded up the coding and decoding process of residual quantity compression, while having ensured efficiency of data compression.

As shown in Figure 1, quick residual quantity compression method of the invention, be for known data block (or file) A and and its Similar reference data block B carries out quick residual quantity compressed encoding, comprising the following specific steps

(1) the quick cutting (FastRolling) based on content is carried out to the reference block B in residual quantity compression, it is more to obtain A word, to constitute word library；This step specifically includes following sub-step:

Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position i_i, wherein B_i Indicate byte content of the reference block B at current sliding position i, as shown in Figure 2；

The hash algorithm both ensure that by a shift left operation and an add operation sliding Hash function (i.e. this Cryptographic Hash can be obtained from last time cryptographic Hash, and old byte content will gradually be removed from cryptographic Hash with shift left operation, new Byte content enters new cryptographic Hash by additional calculation), at the same it is fast but also with calculating speed, and cryptographic Hash randomness waits by force spies Point.

It is all 0 (i.e. that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, Judge f& (2^p- 1) whether value is 0；The value of p meets following relationship: the average length of the word obtained after cutting reference block B It is approximately equal to 2^pA byte, for example, if word average length is approximately 64, p=6,；General 4≤p≤13, i.e. word are flat Equal length control is in 16 bytes to the range between 8192 bytes), it is to enter step (1-4), otherwise enters step (1- 5)；

Here the quick cutting word method based on content ensure that can generate identical list on the position of identical content Word cut point.Can by adjusting word average length (setting p be different numerical value) control residual quantity compression effectiveness, it is general and Speech, the setting of word average length is bigger, and the granularity of residual quantity compression processing is bigger, and calculating is faster, and compression ratio is lower；Word is average Length setting is smaller, and the granularity of residual quantity compression processing is just smaller, and calculating is slower, and compression ratio is also higher.

As shown in figure 3, being searched based on the word that the quick cutting word of content produces a small amount of number for repeated word, keep away Exempt from the word for largely there are overlapped contents that conventional compression coding generates, simplifies the mistake of the word matched in residual quantity compressed encoding Journey.

(1-4) mark position i is the end of a word, and calculates fingerprint (in this reality to the word using fingerprint algorithm It applies in mode, the fingerprint algorithm used to establish the fingerprint index of the word, is arranged f=1, enters step (1- for xxHash) 5)；

(2) the quick cutting based on content is carried out to data block A similar with reference block B, and to during quick cutting The repeated word detected amplifies, to obtain repeated word and non-duplicate word；This step specifically includes following sub-step:

(2-2) calculates quick sliding cryptographic Hash g, g=g < < 1+A of the data block A at the j of current location_j, wherein A_iIndicate number According to byte content of the block A at current sliding position j；

It is all 0 (i.e. that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, Judge f& (2^p- 1) whether value is 0, and p value is identical as step (1-3) here), it is to enter step (2-4)；Otherwise it is transferred to Step (2-7)；

(2-4) mark position j is the end of a word W, and is utilized with identical fingerprint algorithm in step (1-4) to this Word W calculates fingerprint h, and (fingerprint algorithm in the present embodiment, used is xxHash)；

(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if (h exists in fingerprint index, and corresponding word V and word W byte content also exactly match in reference block B), then into Enter step (2-6)；Otherwise label word W is non-duplicate word, and f=1 is arranged, enters step (2-7)；

(2-6) mark word W be repeated word, and continue compare the subsequent byte content of word W and V it is whether identical, one As soon as denier encounters a different byte, stop comparing (as shown in Figure 4).Subsequent identical byte number k is finally obtained, setting is single The length of word W increases k；J=j+k, f=1, and return step (2-2) are set；

Whether identical continue the comparison subsequent byte content of repeated word W and V, is to repeat list using proposed by the present invention Word amplification method.The region of amplification does not need to carry out time-consuming based on the operation such as content piecemeal, fingerprint calculating, fingerprint index, section Time overhead is saved.

(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and are stored by cutting sequence, Repeated word and non-duplicate word are recorded, using two different data formats respectively to obtain differential data block Δ_B,A；Tool For body, as shown in figure 5, referring to two kinds of data formats with ' 0 ' and ' 1 ' generation respectively, ' 0 ' data format record repeated word is used to exist Position and length information in reference block B use ' 1 ' data format to record the length information and byte content of non-duplicate word, (Δ is denoted as the differential data block of reference block B to obtain data block A_B,A)。

(4) it is needing to differential data block Δ_B,AWhen being decoded operation, successively from Δ_B,ATwo kinds of data formats of middle acquisition Record output stream is written into these order of words to successively obtain all words of data block A, it is complete to recover Data block A；Specifically, as shown in figure 5, record for ' 0 ' data format, according to position and length information, from reference block B Middle acquisition repeated word；For the record of ' 1 ' data format, non-duplicate word is directly taken out from the record.

To sum up, the present invention can speed up residual quantity compression encoding process, and have repeated word search efficiency fast, calculates The advantages that expense is small and efficiency of data compression is high.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of quick residual quantity compression method, comprising the following steps:

(1) the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, to constitute list Dictionary；

(2) the quick cutting based on content is carried out to data block A similar with reference block B, and is detected to during quick cutting To repeated word amplify, to obtain repeated word and non-duplicate word；

Wherein, step (2) includes following sub-step:

(2-2) calculates quick sliding cryptographic Hash g, g=(g < < the 1)+A of data block A at the j of current location_j, wherein A_jIndicate data Byte content of the block A at current sliding position j；

It is all 0 that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, be then into Enter step (2-4), is otherwise transferred to step (2-7)；

(2-4) mark position j is the end of a word W, and calculates fingerprint h to word W using fingerprint algorithm；

(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if then Enter step (2-6)；Otherwise label word W is non-duplicate word, and g=1 is arranged, enters step (2-7)；

It is repeated word that (2-6), which marks word W, and continues to compare repeated word V subsequent word of the word W and W in reference block B Whether identical save content, once as soon as encounter a different byte, stop comparing, finally obtain subsequent identical byte number k, The length that word W is arranged increases k, and j=j+k, g=1, and return step (2-2) is arranged；

J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), the last character until having handled data block A Until section；

(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and is stored by cutting sequence, and point Repeated word and non-duplicate word are not recorded, using two different data formats to obtain differential data block △_B,A；

(4) it is needing to differential data block △_B,AWhen being decoded operation, successively from △_B,AThe middle note for obtaining two kinds of data formats To successively obtain all words of data block A output stream is written, to recover complete data in these order of words by record Block A.

2. quick residual quantity compression method according to claim 1, which is characterized in that step (1) includes following sub-step:

Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position i_i, wherein B_iIt indicates Byte content of the reference block B at current sliding position i；

It is all 0 that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, be then into Enter step (1-4), otherwise enters step (1-5)；

(1-4) mark position i is the end of a word, and calculates fingerprint to the word using fingerprint algorithm, to establish the list The fingerprint index of word is arranged f=1, enters step (1-5)；

I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), the last character until having handled data block B Until section.

3. quick residual quantity compression method according to claim 1, which is characterized in that step (3) is specifically, use ' 0 ' number According to position and length information of the format record repeated word in reference block B, ' 1 ' data format is used to record non-duplicate word Length information and byte content, to obtain data block A for the differential data block △ of reference block B_B,A。

4. quick residual quantity compression method according to claim 1, which is characterized in that step (4) is specifically, for ' 0 ' number Repeated word is obtained from reference block B according to position and length information according to the record of format；For the note of ' 1 ' data format Record, directly takes out non-duplicate word from the record.