CN105515586A - Rapid delta compression method - Google Patents

Rapid delta compression method Download PDF

Info

Publication number
CN105515586A
CN105515586A CN201510927001.5A CN201510927001A CN105515586A CN 105515586 A CN105515586 A CN 105515586A CN 201510927001 A CN201510927001 A CN 201510927001A CN 105515586 A CN105515586 A CN 105515586A
Authority
CN
China
Prior art keywords
word
repeated
data block
quick
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510927001.5A
Other languages
Chinese (zh)
Other versions
CN105515586B (en
Inventor
夏文
冯丹
李春光
江泓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201510927001.5A priority Critical patent/CN105515586B/en
Publication of CN105515586A publication Critical patent/CN105515586A/en
Application granted granted Critical
Publication of CN105515586B publication Critical patent/CN105515586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication

Abstract

The invention discloses a rapid delta compression method, which comprises the following steps of carrying out content-based rapid segmentation on a reference block B in delta compression to obtain a plurality of words in order to form a word library; carrying out content-based rapid segmentation on a reference block A similar to the reference block B and amplifying repeated words detected during rapid segmentation to obtain repeated words and non-repeated words; encoding and storing the obtained repeated words and non-repeated words in sequence according to a segmentation order and respectively utilizing two different data formats to record the repeated words and the non-repeated words in order to obtain a delta data block deltaB,A; and when a decoding operation on the delta data block deltaB,A is needed, obtaining records of two data formats from the deltaB,A in sequence in order to obtain all words of the data block A in sequence, and writing the words into an output stream in sequence to restore the complete data block A. The method has the advantages of high search efficiency for the repeated words, small computing cost, high data compression efficiency and the like.

Description

A kind of residual quantity compression method fast
Technical field
The invention belongs to the field of data compression of Computer Storage, more specifically, relate to a kind of residual quantity compression method fast.
Background technology
In recent years, along with the development of computer technology and network is popularized, the data information memory amount in the whole world is the trend of explosive growth.Although the price of memory device is ceaselessly declining always, be also unable to catch up with the speed that data augmentation increases far away.Data de-duplication (DataDeduplication), as a kind of technology by effectively eliminating redundant data on a large scale, becomes the focus of storage system research in the last few years.Data de-duplication greatly can not only be saved memory space thus improve the utilance of storage resources, and can by the efficiency of transmission avoiding the transmission of redundant data to improve the network bandwidth.
But along with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Because traditional data de-duplication technology carries out repeating data judgement based on the fingerprint of data block, so which has limited data de-duplication technology can only identify the data block repeated completely, and those very similar data blocks can not be identified.Such as two data block A1 and the A2 situation that only several byte is different, although A1 with A2 is close to completely similar, data de-duplication technology can produce distinct data fingerprint thus have ignored the redundancy process to these set of metadata of similar data.So residual quantity compression (DeltaCompression) technology is just suggested and is applied in this occasion, residual quantity compression is an efficient data compression technique, and it can according to reference data block (or citing approvingly by data block) A rto its set of metadata of similar data block A icarry out high compression.The similarity of data block is higher, then compression efficiency is higher.As shown by the equation, A rand A iinput Delta algorithm device, Delta algorithm device can export a differential data and (be denoted as △ r,i) representation file A icompressed version.As needs decompressed data A i, then differential data △ is read r,iwith reference data block A rnamely data A can be calculated i.
So comparatively speaking, but residual quantity compression can eliminate non-duplicate similar redundant data, thus obtains larger data compression ratio for residual quantity compression and data de-duplication technology.
But existing residual quantity compress technique exists following problem: its compressed encoding is slow, index expense is large, and efficiency of data compression is low, poor expandability; For the Xdelta residual quantity compression algorithm of the Univ. of California, Berkeley classics extensively adopted at present, its compressed encoding speed is approximately only 30-60MB/s (adopting intel tetra-core Xeon2.6Ghz processor), and residual quantity compressed encoding speed seriously limits popularization and the development of this algorithm so slowly.
Summary of the invention
For above defect or the Improvement requirement of prior art, the invention provides a kind of residual quantity compression method fast, its object is to, by by content-based for set of metadata of similar data block (or file) quick cutting word, calculate the operations such as word Hash, index search repeated word, identify the different pieces of information between set of metadata of similar data block, realize final residual quantity code storage, thus saving memory space, and solve that the compressed encoding that exists in existing residual quantity compress technique is slow, index expense is large, efficiency of data compression is low, the technical problem of poor expandability.
For achieving the above object, according to one aspect of the present invention, provide a kind of residual quantity compression method fast, comprise the following steps:
(1) content-based quick cutting is carried out to the reference block B in residual quantity compression, to obtain multiple word, thus form word library;
(2) content-based quick cutting is carried out to the data block A similar to reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word;
(3) repeated word obtained in step (2) and non-duplicate word encoded successively by cutting order and store, and using two kinds of different data formats to record repeated word and non-duplicate word, to obtain differential data block △ respectively b,A;
(4) at needs to differential data block △ b,Awhen carrying out decode operation, successively from △ b,Athe record of middle acquisition two kinds of data formats, thus all words obtaining data block A successively, by these order of words write output stream, to recover complete data block A.
Preferably, step (1) comprises following sub-step:
(1-1) initialization quick sliding cryptographic Hash f=1, arranges the current sliding position i=1 of reference block B;
(1-2) Calculation Basis block B is at the quick sliding cryptographic Hash f at current sliding position i place, f=(f<<1)+B i, wherein B irepresent the byte content of reference block B at current sliding position i place;
(1-3) whether the quick sliding cryptographic Hash f that determining step (1-2) calculates meets a minimum p position is all 0, is, enters step (1-4), otherwise enters step (1-5);
(1-4) mark position i is the end of a word, and utilizes fingerprint algorithm to this word calculated fingerprint, to set up the fingerprint index of this word, arranges f=1, enters step (1-5);
(1-5) i=i+1 is set, and repeated execution of steps (1-2) and (1-3), until process last byte of data block B.
Preferably, step (2) comprises following sub-step:
(2-1) initialization quick sliding cryptographic Hash g=1, the current sliding position j=1 of setting data block A;
(2-2) calculated data block A is at the quick sliding cryptographic Hash g at current location j place, g=(g<<1)+A j, wherein A irepresent the byte content of data block A at current sliding position j place;
(2-3) whether the quick sliding cryptographic Hash g that determining step (2-2) calculates meets a minimum p position is all 0, is, enters step (2-4); Otherwise proceed to step (2-6);
(2-4) mark position j is the end of a word W, and utilizes the fingerprint algorithm identical with step (1-4) to this word W calculated fingerprint h;
(2-5) search the word library of reference block B according to fingerprint h, detect whether W is repeated word by fingerprint index, if then enter step (2-5); Otherwise mark word W is non-duplicate word, arranges f=1, enters step (2-6);
(2-6) marking word W is repeated word, and whether the byte content continuing the repeated word V of comparison word W with W in reference block B follow-up is identical, once run into not identical byte, just stop comparison, finally obtain follow-up identical byte number k, the length arranging word W increases k, arranges j=j+k, f=1, and return step (2-2);
(2-7) j=j+1 is set, and repeated execution of steps (2-2) and (2-3), until process last byte of data block A.
Preferably, step (3) is specially, use the position of ' 0 ' data format record repeated word in reference block B and length information, use length information and the byte content of ' 1 ' data format record non-duplicate word, to obtain the differential data block △ of data block A for reference block B b,A.
Preferably, step (4) is specially, and for the record of ' 0 ' data format, according to position and length information, from reference block B, obtains repeated word; For the record of ' 1 ' data format, directly from this record, take out non-duplicate word.
In general, the above technical scheme conceived by the present invention compared with prior art, can obtain following beneficial effect:
1, the present invention is by the method (i.e. step (1-2), (1-3), (2-2) and (2-3)) of content-based quick cutting word, each slip only needs a shift left operation and an add operation, reach cutting word effect fast, it also avoid the word having overlapped contents in a large number that conventional compression coding produces simultaneously, simplify the word matched process in residual quantity compression encoding process, decrease corresponding internal memory and computing cost, and very high residual quantity compression efficiency can be ensured.
Whether 2, the present invention amplifies strategy by repeated word, directly repeat the repeated word comparison content backward detected, avoid the calculating operations such as the piecemeal to repeated word further part byte, Hash, retrieval, accelerate residual quantity compression encoding process; Simultaneously because repeated word amplifies the record number decreased in differential data, so also accelerate the process of decoding.And the method can safeguard the locality of repeated word, ensure higher efficiency of data compression.
3, content-based quick cutting word method of the present invention helps repeated word amplification method to locate rapidly and has looked for repeated word, and repeated word amplification method then helps residual quantity to compress the calculating such as piecemeal, Hash reducing further part.So the combination of both brings out the best in each other, obtain very fast residual quantity compressed encoding speed.The more duplicate contents that the repeated word of repeated word amplification method greed detection substantially is simultaneously follow-up, has ensured higher residual quantity compression efficiency.
4, the merging non-duplicate word that the present invention proposes carries out the process of encoding, and for multiple continuous print non-duplicate word, adopts once record, accelerates the process of residual quantity compressed encoding and decoding.
Accompanying drawing explanation
Fig. 1 is the flow chart of the quick residual quantity compaction coding method of the present invention.
Fig. 2 is the quick sliding hash method schematic diagram that the present invention adopts.
Fig. 3 be the content-based piecemeal that adopts of the present invention with conventional compression method compare schematic diagram.
Fig. 4 is the word amplification method schematic diagram that the present invention adopts.
Fig. 5 is the coded format schematic diagram that the present invention adopts.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.In addition, if below in described each execution mode of the present invention involved technical characteristic do not form conflict each other and just can mutually combine.
As shown in Figure 1, residual quantity compression of the present invention mainly comprises three parts, the residual quantity coding of content-based quick cutting word, repeated word amplification method, merging non-duplicate word.
The content-based quick cutting word method that the present invention proposes, avoids conventional compression coding and produces a large amount of words having overlapped contents, simplify the word matched process in residual quantity compression encoding process.As shown in Figure 2, traditional residual quantity compression method creates a large amount of overlapping character strings, finds more repeated word with hope; The content-based quick cutting word method that the present invention proposes, creates less word, the position offset problem simultaneously content modification also being avoided to cause: namely ensure that having arrived identical content can produce same word cut point.The advantage of the method: each slip only needs a sub-addition and a shift left operation, internal memory and computing cost little, very high residual quantity compression efficiency can be ensured simultaneously.
The repeated word amplification method that the present invention proposes, directly to the repeated word detected matching content backward, just can avoid piecemeal and fingerprint calculating, the index operation etc. of magnification region.The method launches based on the word locality characteristic of data flow, and the word locality in storage system refers to, when word once occurs with sequence A, B, C, when so there is word A next time, after word B and C probably can follow closely.The present invention excavates the word locality of this data flow to carry out the work of residual quantity compressed encoding, as shown in Figure 4, and the word sequence for the twice set of metadata of similar data block in front and back: B 1, B 2, B 3, B 4, B 5and A 1, A 2, A 3, A 4, A 5, adopt the method for content-based quick cutting word to determine B 2and A 2repeat.According to aforesaid principle of locality, word B 3, B 4, B 5very possible respectively with word A 3, A 4, A 5repeat respectively, such B 3, B 4, B 5and A 3, A 4, A 5the operation such as piecemeal, Hash, index consuming time just there is no need, only need word from B 2and A 2comparison expands { B to backward 2, B 3, B 4, B 5and { A 2, A 3, A 4, A 5.Meanwhile, the method that repeated word amplifies is not limited to the border that content-based word divides, and more duplicate contents, in the position of certain word inside, so just can be found in the position that may finally stop.As long as so this repeated word simple comparison subsequent content of method of amplifying, thus simplify and accelerate the encoding operation of residual quantity compression, and ensureing the efficiency of data compression that acquisition is higher.
The residual quantity coding of the merging non-duplicate word that the present invention proposes, as shown in Figure 5, for continuous print non-duplicate word, is merged into a word coding, improves residual quantity compression efficiency, also can accelerate the decode procedure of residual quantity compression.Meanwhile, according to the combination of above-mentioned multinomial technology, greatly accelerate the Code And Decode process of residual quantity compression, ensured efficiency of data compression simultaneously.
As shown in Figure 1, quick residual quantity compression method of the present invention, is carry out quick residual quantity compressed encoding for known data block (or file) A and the reference data block B similar to it, comprises following concrete steps:
(1) content-based quick cutting (FastRolling) is carried out to the reference block B in residual quantity compression, to obtain multiple word, thus form word library; This step specifically comprises following sub-step:
(1-1) initialization quick sliding cryptographic Hash f=1, arranges the current sliding position i=1 of reference block B;
(1-2) Calculation Basis block B is at the quick sliding cryptographic Hash f at current sliding position i place, f=(f<<1)+B i, wherein B irepresent the byte content of reference block B at current sliding position i place, as shown in Figure 2;
By a shift left operation and an add operation, this hash algorithm both ensure that (namely this cryptographic Hash can obtain from cryptographic Hash last time for the function of slip Hash, old byte content progressively will remove along with shift left operation from cryptographic Hash, new byte content enters new cryptographic Hash by additional calculation), possesses again computational speed fast, the features such as cryptographic Hash randomness is strong simultaneously.
(1-3) whether the quick sliding cryptographic Hash f that determining step (1-2) calculates meets a minimum p position is all 0 (namely judge f & (2 p-1) whether value is 0; The value of p meets following relational expression: the average length of the word obtained after cutting reference block B is approximately equal to 2 pindividual byte, such as, if word average length is approximately 64, then p=6; General 4<=p<=13, namely word average length controls the scope between 16 bytes to 8192 bytes), be enter step (1-4), otherwise enter step (1-5);
Here content-based quick cutting word method ensure that and can produce identical word cut point on the position of identical content.Can control residual quantity compression effectiveness by adjustment word average length (arranging p is different numerical value), generally speaking, word average length arranges larger, and the granularity of residual quantity compression process is larger, and calculate faster, compression ratio is lower; Word average length arranges less, and the granularity of residual quantity compression process is less, and calculate slower, compression ratio is also higher.
As shown in Figure 3, the word that content-based quick cutting word creates a small amount of number is searched for repeated word, avoids the word having overlapped contents in a large number that conventional compression coding produces, simplifies the process of the word matched in residual quantity compressed encoding.
(1-4) mark position i is the end of a word, and utilize fingerprint algorithm to this word calculated fingerprint (in the present embodiment, the fingerprint algorithm of employing is xxHash), to set up the fingerprint index of this word, f=1 is set, enters step (1-5);
(1-5) i=i+1 is set, and repeated execution of steps (1-2) and (1-3), until process last byte of data block B.
(2) content-based quick cutting is carried out to the data block A similar to reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word; This step specifically comprises following sub-step:
(2-1) initialization quick sliding cryptographic Hash g=1, the current sliding position j=1 of setting data block A;
(2-2) calculated data block A is at the quick sliding cryptographic Hash g at current location j place, g=g<<1+A j, wherein A irepresent the byte content of data block A at current sliding position j place;
(2-3) whether the quick sliding cryptographic Hash g that determining step (2-2) calculates meets a minimum p position is all 0 (namely judge f & (2 p-1) whether value is 0, and p value is identical with step (1-3) here), be enter step (2-4); Otherwise proceed to step (2-6);
(2-4) mark position j is the end of a word W, and utilizes the fingerprint algorithm identical with step (1-4) to this word W calculated fingerprint h (in the present embodiment, the fingerprint algorithm of employing is xxHash);
(2-5) word library of reference block B is searched according to fingerprint h, detect whether W is repeated word by fingerprint index, if (h exists in fingerprint index, and word V corresponding in reference block B also mates completely with word W byte content), then enter step (2-5); Otherwise mark word W is non-duplicate word, arranges f=1, enters step (2-6);
(2-6) marking word W is repeated word, and whether the byte content continuing comparison word W with V follow-up is identical, once run into not identical byte, just stops comparison (as shown in Figure 4).Finally obtain follow-up identical byte number k, the length arranging word W increases k; J=j+k is set, f=1, and returns step (2-2);
Whether the byte content continuing comparison repeated word W with V follow-up is identical, is namely the repeated word amplification method using the present invention to propose.The region of amplifying does not need to carry out the operations such as time-consuming content-based piecemeal, fingerprint calculating, fingerprint index, saves time overhead.
(2-7) j=j+1 is set, and repeated execution of steps (2-2) and (2-3), until process last byte of data block A.
(3) repeated word obtained in step (2) and non-duplicate word encoded successively by cutting order and store, using two kinds of different data formats to record repeated word and non-duplicate word, to obtain differential data block △ respectively b,A; Specifically, as shown in Figure 5, use ' 0 ' and ' 1 ' acute pyogenic infection of finger tip two kinds of data formats respectively, use the position of ' 0 ' data format record repeated word in reference block B and length information, use the length information of ' 1 ' data format record non-duplicate word and byte content, thus obtain data block A (△ is denoted as the differential data block of reference block B b,A).
(4) at needs to differential data block △ b,Awhen carrying out decode operation, successively from △ b,Athe record of middle acquisition two kinds of data formats, thus all words obtaining data block A successively, by these order of words write output stream, to recover complete data block A; Specifically, as shown in Figure 5, for the record of ' 0 ' data format, according to position and length information, from reference block B, repeated word is obtained; For the record of ' 1 ' data format, directly from this record, take out non-duplicate word.
Generally speaking, the present invention can accelerate residual quantity compression encoding process, and it is fast to have repeated word search efficiency, the little and efficiency of data compression advantages of higher of computing cost.
Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (5)

1. a quick residual quantity compression method, comprises the following steps:
(1) content-based quick cutting is carried out to the reference block B in residual quantity compression, to obtain multiple word, thus form word library;
(2) content-based quick cutting is carried out to the data block A similar to reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word;
(3) repeated word obtained in step (2) and non-duplicate word encoded successively by cutting order and store, and using two kinds of different data formats to record repeated word and non-duplicate word, to obtain differential data block △ respectively b,A;
(4) at needs to differential data block △ b,Awhen carrying out decode operation, successively from △ b,Athe record of middle acquisition two kinds of data formats, thus all words obtaining data block A successively, by these order of words write output stream, to recover complete data block A.
2. quick residual quantity compression method according to claim 1, is characterized in that, step (1) comprises following sub-step:
(1-1) initialization quick sliding cryptographic Hash f=1, arranges the current sliding position i=1 of reference block B;
(1-2) Calculation Basis block B is at the quick sliding cryptographic Hash f at current sliding position i place, f=(f<<1)+B i, wherein B irepresent the byte content of reference block B at current sliding position i place;
(1-3) whether the quick sliding cryptographic Hash f that determining step (1-2) calculates meets a minimum p position is all 0, is, enters step (1-4), otherwise enters step (1-5);
(1-4) mark position i is the end of a word, and utilizes fingerprint algorithm to this word calculated fingerprint, to set up the fingerprint index of this word, arranges f=1, enters step (1-5);
(1-5) i=i+1 is set, and repeated execution of steps (1-2) and (1-3), until process last byte of data block B.
3. quick residual quantity compression method according to claim 1, is characterized in that, step (2) comprises following sub-step:
(2-1) initialization quick sliding cryptographic Hash g=1, the current sliding position j=1 of setting data block A;
(2-2) calculated data block A is at the quick sliding cryptographic Hash g at current location j place, g=(g<<1)+A j, wherein A irepresent the byte content of data block A at current sliding position j place;
(2-3) whether the quick sliding cryptographic Hash g that determining step (2-2) calculates meets a minimum p position is all 0, is, enters step (2-4), otherwise proceeds to step (2-6);
(2-4) mark position j is the end of a word W, and utilizes the fingerprint algorithm identical with step (1-4) to this word W calculated fingerprint h;
(2-5) search the word library of reference block B according to fingerprint h, detect whether W is repeated word by fingerprint index, if then enter step (2-5); Otherwise mark word W is non-duplicate word, arranges f=1, enters step (2-6);
(2-6) marking word W is repeated word, and whether the byte content continuing the repeated word V of comparison word W with W in reference block B follow-up is identical, once run into not identical byte, just stop comparison, finally obtain follow-up identical byte number k, the length arranging word W increases k, arranges j=j+k, f=1, and return step (2-2);
(2-7) j=j+1 is set, and repeated execution of steps (2-2) and (2-3), until process last byte of data block A.
4. quick residual quantity compression method according to claim 1, it is characterized in that, step (3) is specially, use the position of ' 0 ' data format record repeated word in reference block B and length information, use length information and the byte content of ' 1 ' data format record non-duplicate word, to obtain the differential data block △ of data block A for reference block B b,A.
5. quick residual quantity compression method according to claim 1, is characterized in that, step (4) is specially, and for the record of ' 0 ' data format, according to position and length information, from reference block B, obtains repeated word; For the record of ' 1 ' data format, directly from this record, take out non-duplicate word.
CN201510927001.5A 2015-12-14 2015-12-14 A kind of quick residual quantity compression method Active CN105515586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510927001.5A CN105515586B (en) 2015-12-14 2015-12-14 A kind of quick residual quantity compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510927001.5A CN105515586B (en) 2015-12-14 2015-12-14 A kind of quick residual quantity compression method

Publications (2)

Publication Number Publication Date
CN105515586A true CN105515586A (en) 2016-04-20
CN105515586B CN105515586B (en) 2019-04-12

Family

ID=55723301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510927001.5A Active CN105515586B (en) 2015-12-14 2015-12-14 A kind of quick residual quantity compression method

Country Status (1)

Country Link
CN (1) CN105515586B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844479A (en) * 2016-12-23 2017-06-13 光锐恒宇(北京)科技有限公司 The compression of file, decompressing method and device
CN108268628A (en) * 2018-01-15 2018-07-10 深圳前海信息技术有限公司 The method and device of delta compression based on dynamic anchor point
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN111796969A (en) * 2020-05-29 2020-10-20 湖北工业大学 Data difference compression detection method, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102684827A (en) * 2012-03-02 2012-09-19 华为技术有限公司 Data processing method and data processing equipment
CN102831222A (en) * 2012-08-24 2012-12-19 华中科技大学 Differential compression method based on data de-duplication
US20130198150A1 (en) * 2012-01-30 2013-08-01 Samsung Electronics Co., Ltd. File-type dependent data deduplication

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198150A1 (en) * 2012-01-30 2013-08-01 Samsung Electronics Co., Ltd. File-type dependent data deduplication
CN102684827A (en) * 2012-03-02 2012-09-19 华为技术有限公司 Data processing method and data processing equipment
CN102831222A (en) * 2012-08-24 2012-12-19 华中科技大学 Differential compression method based on data de-duplication

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
夏文: "数据备份系统中冗余数据的高性能消除技术研究", 《中国博士学位论文全文数据库(信息科技辑)》 *
谢垂益等: "基于极值点分块的重复数据检测算法", 《技术研究》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844479A (en) * 2016-12-23 2017-06-13 光锐恒宇(北京)科技有限公司 The compression of file, decompressing method and device
CN106844479B (en) * 2016-12-23 2020-07-07 光锐恒宇(北京)科技有限公司 Method and device for compressing and decompressing file
CN108268628A (en) * 2018-01-15 2018-07-10 深圳前海信息技术有限公司 The method and device of delta compression based on dynamic anchor point
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN110083743B (en) * 2019-03-28 2021-11-16 哈尔滨工业大学(深圳) Rapid similar data detection method based on unified sampling
CN111796969A (en) * 2020-05-29 2020-10-20 湖北工业大学 Data difference compression detection method, computer equipment and storage medium

Also Published As

Publication number Publication date
CN105515586B (en) 2019-04-12

Similar Documents

Publication Publication Date Title
US11243915B2 (en) Method and apparatus for data deduplication
CN102831222B (en) Differential compression method based on data de-duplication
US8543555B2 (en) Dictionary for data deduplication
US20180196609A1 (en) Data Deduplication Using Multi-Chunk Predictive Encoding
US8174412B2 (en) Combined hash for variable data chunks
JP7122325B2 (en) Lossless reduction of data using the base data sieve and performing multidimensional search and content-associative retrieval on the losslessly reduced data using the base data sieve
CN102323958A (en) Data de-duplication method
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
US20120124011A1 (en) Method for increasing deduplication speed on data streams fragmented by shuffling
AU2010200866B1 (en) Data reduction indexing
US20120136842A1 (en) Partitioning method of data blocks
CN105515586A (en) Rapid delta compression method
CN103152430B (en) A kind of reduce the cloud storage method that data take up room
US20170344579A1 (en) Data deduplication
CN102999433A (en) Redundant data deletion method and system of virtual disks
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
JP6726690B2 (en) Performing multidimensional search, content-associative retrieval, and keyword-based retrieval and retrieval on losslessly reduced data using basic data sieves
JP2023525791A (en) Exploiting Base Data Locality for Efficient Retrieval of Lossless Reduced Data Using Base Data Sieves
CN106980680B (en) Data storage method and storage device
CN103678158A (en) Optimization method and system for data layout
CN113672170A (en) Redundant data marking and removing method
US10564848B2 (en) Information storage device and method for deduplication
US11347423B2 (en) System and method for detecting deduplication opportunities
CN114327252A (en) Data reduction in block-based storage systems using content-based block alignment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant