CN105515586B - A kind of quick residual quantity compression method - Google Patents

A kind of quick residual quantity compression method Download PDF

Info

Publication number
CN105515586B
CN105515586B CN201510927001.5A CN201510927001A CN105515586B CN 105515586 B CN105515586 B CN 105515586B CN 201510927001 A CN201510927001 A CN 201510927001A CN 105515586 B CN105515586 B CN 105515586B
Authority
CN
China
Prior art keywords
word
quick
data
block
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510927001.5A
Other languages
Chinese (zh)
Other versions
CN105515586A (en
Inventor
夏文
冯丹
李春光
江泓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201510927001.5A priority Critical patent/CN105515586B/en
Publication of CN105515586A publication Critical patent/CN105515586A/en
Application granted granted Critical
Publication of CN105515586B publication Critical patent/CN105515586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication

Abstract

The invention discloses a kind of quick residual quantity compression methods, it include: that the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, to constitute word library, quick cutting based on content is carried out to data block A similar with reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word, obtained repeated word and non-duplicate word are successively encoded and stored by cutting sequence, and repeated word and non-duplicate word are recorded using two different data formats respectively, to obtain differential data block △B,A, needing to differential data block △B,AWhen being decoded operation, successively from △B,ATo successively obtain all words of data block A output stream is written, to recover complete data block A in these order of words by the middle record for obtaining two kinds of data formats.The present invention has many advantages, such as that repeated word search efficiency is fast, and computing cost is small and efficiency of data compression is high.

Description

A kind of quick residual quantity compression method
Technical field
The invention belongs to the field of data compression of computer storage, more particularly, to a kind of quick residual quantity compression method.
Background technique
In recent years, as the development of computer technology and network is universal, global data information memory amount is in explosive increasing Long trend.Although the price of storage equipment is ceaselessly declining always, also much it is unable to catch up with the speed of data augmentation growth. Data de-duplication (Data Deduplication) as it is a kind of by a large scale effectively eliminate redundant data technology, Hot spot as storage system research in recent years.Data de-duplication can not only save memory space greatly to improve The utilization rate of storage resource, and can be by avoiding the transmission of redundant data from improving the efficiency of transmission of network bandwidth.
But with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Due to Traditional data de-duplication technology is the fingerprint based on data block to carry out repeated data judgement, so which has limited repeat numbers Complete duplicate data block can only be identified according to deleting technique, and cannot identify those much like data blocks.Such as two data The block A1 and A2 only different situation of several bytes, although A1 and A2, close to completely similar, data de-duplication technology can produce Raw completely different data fingerprint is to have ignored the redundancy processing to these set of metadata of similar data.Then residual quantity compresses (Delta Compression) technology, which is just suggested, applies in this occasion, and residual quantity compression is an efficient data compression technique, it It can be according to reference data block (or reference data block) ArTo its set of metadata of similar data block AiCarry out high compression.Data block it is similar Degree is higher, then compression efficiency is higher.As shown by the equation, ArAnd AiDelta algorithm device is inputted, Delta algorithm device can export one Differential data (is denoted as Δr,i) represent file AiCompressed version.It such as needs to decompress data Ai, then differential data Δ is readr,iAnd base Quasi- data block ArData A can be calculatedi
So residual quantity compression is with data de-duplication technology comparatively, residual quantity compression can eliminate non-duplicate but phase As redundant data, to obtain bigger data compression ratio.
However, existing residual quantity compress technique has the following problems: its compressed encoding is slow, and index expense is big, data compression Low efficiency, poor expandability;By taking the Xdelta residual quantity compression algorithm for the Univ. of California, Berkeley classics being widely used at present as an example, Its compressed encoding rate is slowly poor so only about 30-60MB/s (using tetra- core Xeon 2.6Ghz processor of intel) Amount compressed encoding rate seriously limits the popularization and development of the algorithm.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of quick residual quantity compression method, Purpose is, by the way that set of metadata of similar data block (or file) is based on the quick cutting word of content, calculates word Hash, index lookup weight The operations such as multiple word, identify the different data between set of metadata of similar data block, final residual quantity code storage are realized, to save storage Space, and solve that compressed encoding present in existing residual quantity compress technique is slow, index expense is big, efficiency of data compression is low, can expand The technical problem of malleability difference.
To achieve the above object, according to one aspect of the present invention, a kind of quick residual quantity compression method is provided, including with Lower step:
(1) the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, thus structure At word library;
(2) the quick cutting based on content is carried out to data block A similar with reference block B, and to during quick cutting The repeated word detected amplifies, to obtain repeated word and non-duplicate word;
(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and are stored by cutting sequence, And repeated word and non-duplicate word are recorded using two different data formats respectively, to obtain differential data block ΔB,A
(4) it is needing to differential data block ΔB,AWhen being decoded operation, successively from ΔB,ATwo kinds of data formats of middle acquisition Record output stream is written into these order of words to successively obtain all words of data block A, it is complete to recover Data block A.
Preferably, step (1) includes following sub-step:
(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged;
Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position ii, wherein Bi Indicate byte content of the reference block B at current sliding position i;
It is all 0 that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, is (1-4) is then entered step, (1-5) is otherwise entered step;
(1-4) mark position i is the end of a word, and calculates fingerprint to the word using fingerprint algorithm, to establish The fingerprint index of the word is arranged f=1, enters step (1-5);
I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), until handled data block B last Until a byte.
Preferably, step (2) includes following sub-step:
(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged;
(2-2) calculates quick sliding cryptographic Hash g, g=(g < < the 1)+A of data block A at the j of current locationj, wherein AjTable Show byte content of the data block A at current sliding position j;
It is all 0 that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, is Then enter step (2-4);Otherwise it is transferred to step (2-7);
(2-4) mark position j is the end of a word W, and is utilized with identical fingerprint algorithm in step (1-4) to this Word W calculates fingerprint h;
(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if It is to enter step (2-6);Otherwise label word W is non-duplicate word, and f=1 is arranged, enters step (2-7);
It is repeated word that (2-6), which marks word W, and it is subsequent to continue repeated word V of the comparison word W and W in reference block B Byte content it is whether identical, once as soon as encounter a different byte, stop compare, finally obtain subsequent identical byte The length of number k, setting word W increase k, and j=j+k, f=1, and return step (2-2) is arranged;
J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), until handled data block A last Until a byte.
Preferably, step (3) is specifically, use position and length of ' 0 ' the data format record repeated word in reference block B Information is spent, uses ' 1 ' data format to record the length information and byte content of non-duplicate word, to obtain data block A for base The differential data block Δ of quasi- block BB,A
Preferably, step (4) is specifically, record for ' 0 ' data format, according to position and length information, from benchmark Repeated word is obtained in block B;For the record of ' 1 ' data format, non-duplicate word is directly taken out from the record.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
1, the present invention passes through method (i.e. step (1-2), (1-3), (2-2) and (2- based on the quick cutting word of content 3)), sliding only needs a shift left operation and an add operation every time, has reached quick cutting word effect, while The word for largely there are overlapped contents that conventional compression coding generates is avoided, the word in residual quantity compression encoding process is simplified With process, reduce corresponding memory and computing cost, and can guarantee very high residual quantity compression efficiency.
2, the present invention amplifies strategy by repeated word, directly compares whether content weighs backward to the repeated word detected It is multiple, it avoids to calculating operations such as the piecemeals, Hash, retrieval of repeated word further part byte, accelerates residual quantity compressed encoding Process;Simultaneously because repeated word amplification reduces the record number in differential data, so also accelerating decoded process.And And this method can safeguard the locality of repeated word, ensure higher efficiency of data compression.
3, of the invention positioned rapidly based on the quick cutting word method help repeated word amplification method of content has looked for weight Multiple word, and repeated word amplification method then helps residual quantity compression to reduce the calculating such as piecemeal, Hash of further part.So this two The combination of person brings out the best in each other, and obtains cracking residual quantity compressed encoding speed.Repeated word amplification method is maximumlly coveted simultaneously The heart detects the subsequent more duplicate contents of repeated word, has ensured higher residual quantity compression efficiency.
4, the process proposed by the present invention for merging non-duplicate word and being encoded, for multiple continuous non-duplicate words, Using primary record, residual quantity compressed encoding and decoded process are accelerated.
Detailed description of the invention
Fig. 1 is the flow chart of quickly residual quantity compaction coding method of the invention.
Fig. 2 is the quick sliding hash method schematic diagram that the present invention uses.
Fig. 3 is the comparison schematic diagram based on content piecemeal Yu conventional compression method that the present invention uses.
Fig. 4 is the word amplification method schematic diagram that the present invention uses.
Fig. 5 is the coded format schematic diagram that the present invention uses.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.
As shown in Figure 1, residual quantity compression of the invention mainly includes three parts, based on the quick cutting word of content, repeat Word amplification method, the residual quantity coding for merging non-duplicate word.
Proposed by the present invention to be based on the quick cutting word method of content, avoiding conventional compression coding and generating largely has weight The word of folded content, simplifies the word matched process in residual quantity compression encoding process.As shown in Fig. 2, traditional residual quantity is compressed Method produces a large amount of overlapping character strings, to wish to find more repeated words;It is proposed by the present invention quick based on content Cutting word method produces less word, while being also avoided that position offset problem caused by content modification: guaranteeing to arrive Identical content can generate same word cut point.The advantages of this method: sliding only needs a sub-addition and moves to left every time Operation, memory and computing cost are small, while can guarantee very high residual quantity compression efficiency.
Repeated word amplification method proposed by the present invention, directly to the repeated word detected matching content backward, so that it may To avoid piecemeal and the fingerprint calculating of magnification region, index operation etc..This method is the word locality characteristic based on data flow Come what is be unfolded, the word locality in storage system refers to, when word once occurs with sequence A, B, C, then next time occurs When word A, word B and C are likely to follow closely below.The present invention excavates the word locality of this data flow to carry out residual quantity Compressed encoding work, as shown in figure 4, for the word sequence of front and back set of metadata of similar data block twice: B1、B2、B3、B4、B5And A1、A2、 A3、A4、A5, B can be determined using the method based on the quick cutting word of content2And A2It repeats.According to principle of locality above-mentioned, Word B3、B4、B5Probably respectively with word A3、A4、A5It repeats respectively, such B3、B4、B5And A3、A4、A5Time-consuming point The operations such as block, Hash, index are just not necessarily to, it is only necessary to by word from B2And A2It compares backward and is expanded to { B2、B3、B4、B5} { A2、A3、A4、A5}.Meanwhile the method for repeated word amplification is not limited to the boundary divided based on content words, it may be final Position of the position of stopping inside some word, can thus find more duplicate contents.So this repeated word As long as the method for amplification simply compares subsequent content, to simplify and speed up the encoding operation of residual quantity compression, and guarantee to obtain Higher efficiency of data compression.
The residual quantity coding proposed by the present invention for merging non-duplicate word, as shown in figure 5, for continuous non-duplicate word, It is merged into a word coding, residual quantity compression efficiency is improved, can also accelerate the decoding process of residual quantity compression.Meanwhile according to upper The combination for stating multinomial technology, has significantly speeded up the coding and decoding process of residual quantity compression, while having ensured efficiency of data compression.
As shown in Figure 1, quick residual quantity compression method of the invention, be for known data block (or file) A and and its Similar reference data block B carries out quick residual quantity compressed encoding, comprising the following specific steps
(1) the quick cutting (FastRolling) based on content is carried out to the reference block B in residual quantity compression, it is more to obtain A word, to constitute word library;This step specifically includes following sub-step:
(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged;
Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position ii, wherein Bi Indicate byte content of the reference block B at current sliding position i, as shown in Figure 2;
The hash algorithm both ensure that by a shift left operation and an add operation sliding Hash function (i.e. this Cryptographic Hash can be obtained from last time cryptographic Hash, and old byte content will gradually be removed from cryptographic Hash with shift left operation, new Byte content enters new cryptographic Hash by additional calculation), at the same it is fast but also with calculating speed, and cryptographic Hash randomness waits by force spies Point.
It is all 0 (i.e. that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, Judge f& (2p- 1) whether value is 0;The value of p meets following relationship: the average length of the word obtained after cutting reference block B It is approximately equal to 2pA byte, for example, if word average length is approximately 64, p=6,;General 4≤p≤13, i.e. word are flat Equal length control is in 16 bytes to the range between 8192 bytes), it is to enter step (1-4), otherwise enters step (1- 5);
Here the quick cutting word method based on content ensure that can generate identical list on the position of identical content Word cut point.Can by adjusting word average length (setting p be different numerical value) control residual quantity compression effectiveness, it is general and Speech, the setting of word average length is bigger, and the granularity of residual quantity compression processing is bigger, and calculating is faster, and compression ratio is lower;Word is average Length setting is smaller, and the granularity of residual quantity compression processing is just smaller, and calculating is slower, and compression ratio is also higher.
As shown in figure 3, being searched based on the word that the quick cutting word of content produces a small amount of number for repeated word, keep away Exempt from the word for largely there are overlapped contents that conventional compression coding generates, simplifies the mistake of the word matched in residual quantity compressed encoding Journey.
(1-4) mark position i is the end of a word, and calculates fingerprint (in this reality to the word using fingerprint algorithm It applies in mode, the fingerprint algorithm used to establish the fingerprint index of the word, is arranged f=1, enters step (1- for xxHash) 5);
I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), until handled data block B last Until a byte.
(2) the quick cutting based on content is carried out to data block A similar with reference block B, and to during quick cutting The repeated word detected amplifies, to obtain repeated word and non-duplicate word;This step specifically includes following sub-step:
(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged;
(2-2) calculates quick sliding cryptographic Hash g, g=g < < 1+A of the data block A at the j of current locationj, wherein AiIndicate number According to byte content of the block A at current sliding position j;
It is all 0 (i.e. that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, Judge f& (2p- 1) whether value is 0, and p value is identical as step (1-3) here), it is to enter step (2-4);Otherwise it is transferred to Step (2-7);
(2-4) mark position j is the end of a word W, and is utilized with identical fingerprint algorithm in step (1-4) to this Word W calculates fingerprint h, and (fingerprint algorithm in the present embodiment, used is xxHash);
(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if (h exists in fingerprint index, and corresponding word V and word W byte content also exactly match in reference block B), then into Enter step (2-6);Otherwise label word W is non-duplicate word, and f=1 is arranged, enters step (2-7);
(2-6) mark word W be repeated word, and continue compare the subsequent byte content of word W and V it is whether identical, one As soon as denier encounters a different byte, stop comparing (as shown in Figure 4).Subsequent identical byte number k is finally obtained, setting is single The length of word W increases k;J=j+k, f=1, and return step (2-2) are set;
Whether identical continue the comparison subsequent byte content of repeated word W and V, is to repeat list using proposed by the present invention Word amplification method.The region of amplification does not need to carry out time-consuming based on the operation such as content piecemeal, fingerprint calculating, fingerprint index, section Time overhead is saved.
J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), until handled data block A last Until a byte.
(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and are stored by cutting sequence, Repeated word and non-duplicate word are recorded, using two different data formats respectively to obtain differential data block ΔB,A;Tool For body, as shown in figure 5, referring to two kinds of data formats with ' 0 ' and ' 1 ' generation respectively, ' 0 ' data format record repeated word is used to exist Position and length information in reference block B use ' 1 ' data format to record the length information and byte content of non-duplicate word, (Δ is denoted as the differential data block of reference block B to obtain data block AB,A)。
(4) it is needing to differential data block ΔB,AWhen being decoded operation, successively from ΔB,ATwo kinds of data formats of middle acquisition Record output stream is written into these order of words to successively obtain all words of data block A, it is complete to recover Data block A;Specifically, as shown in figure 5, record for ' 0 ' data format, according to position and length information, from reference block B Middle acquisition repeated word;For the record of ' 1 ' data format, non-duplicate word is directly taken out from the record.
To sum up, the present invention can speed up residual quantity compression encoding process, and have repeated word search efficiency fast, calculates The advantages that expense is small and efficiency of data compression is high.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (4)

1. a kind of quick residual quantity compression method, comprising the following steps:
(1) the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, to constitute list Dictionary;
(2) the quick cutting based on content is carried out to data block A similar with reference block B, and is detected to during quick cutting To repeated word amplify, to obtain repeated word and non-duplicate word;
Wherein, step (2) includes following sub-step:
(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged;
(2-2) calculates quick sliding cryptographic Hash g, g=(g < < the 1)+A of data block A at the j of current locationj, wherein AjIndicate data Byte content of the block A at current sliding position j;
It is all 0 that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, be then into Enter step (2-4), is otherwise transferred to step (2-7);
(2-4) mark position j is the end of a word W, and calculates fingerprint h to word W using fingerprint algorithm;
(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if then Enter step (2-6);Otherwise label word W is non-duplicate word, and g=1 is arranged, enters step (2-7);
It is repeated word that (2-6), which marks word W, and continues to compare repeated word V subsequent word of the word W and W in reference block B Whether identical save content, once as soon as encounter a different byte, stop comparing, finally obtain subsequent identical byte number k, The length that word W is arranged increases k, and j=j+k, g=1, and return step (2-2) is arranged;
J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), the last character until having handled data block A Until section;
(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and is stored by cutting sequence, and point Repeated word and non-duplicate word are not recorded, using two different data formats to obtain differential data block △B,A
(4) it is needing to differential data block △B,AWhen being decoded operation, successively from △B,AThe middle note for obtaining two kinds of data formats To successively obtain all words of data block A output stream is written, to recover complete data in these order of words by record Block A.
2. quick residual quantity compression method according to claim 1, which is characterized in that step (1) includes following sub-step:
(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged;
Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position ii, wherein BiIt indicates Byte content of the reference block B at current sliding position i;
It is all 0 that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, be then into Enter step (1-4), otherwise enters step (1-5);
(1-4) mark position i is the end of a word, and calculates fingerprint to the word using fingerprint algorithm, to establish the list The fingerprint index of word is arranged f=1, enters step (1-5);
I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), the last character until having handled data block B Until section.
3. quick residual quantity compression method according to claim 1, which is characterized in that step (3) is specifically, use ' 0 ' number According to position and length information of the format record repeated word in reference block B, ' 1 ' data format is used to record non-duplicate word Length information and byte content, to obtain data block A for the differential data block △ of reference block BB,A
4. quick residual quantity compression method according to claim 1, which is characterized in that step (4) is specifically, for ' 0 ' number Repeated word is obtained from reference block B according to position and length information according to the record of format;For the note of ' 1 ' data format Record, directly takes out non-duplicate word from the record.
CN201510927001.5A 2015-12-14 2015-12-14 A kind of quick residual quantity compression method Active CN105515586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510927001.5A CN105515586B (en) 2015-12-14 2015-12-14 A kind of quick residual quantity compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510927001.5A CN105515586B (en) 2015-12-14 2015-12-14 A kind of quick residual quantity compression method

Publications (2)

Publication Number Publication Date
CN105515586A CN105515586A (en) 2016-04-20
CN105515586B true CN105515586B (en) 2019-04-12

Family

ID=55723301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510927001.5A Active CN105515586B (en) 2015-12-14 2015-12-14 A kind of quick residual quantity compression method

Country Status (1)

Country Link
CN (1) CN105515586B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844479B (en) * 2016-12-23 2020-07-07 光锐恒宇(北京)科技有限公司 Method and device for compressing and decompressing file
CN108268628A (en) * 2018-01-15 2018-07-10 深圳前海信息技术有限公司 The method and device of delta compression based on dynamic anchor point
CN110083743B (en) * 2019-03-28 2021-11-16 哈尔滨工业大学(深圳) Rapid similar data detection method based on unified sampling
CN111796969A (en) * 2020-05-29 2020-10-20 湖北工业大学 Data difference compression detection method, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102684827A (en) * 2012-03-02 2012-09-19 华为技术有限公司 Data processing method and data processing equipment
CN102831222A (en) * 2012-08-24 2012-12-19 华中科技大学 Differential compression method based on data de-duplication

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130087850A (en) * 2012-01-30 2013-08-07 삼성전자주식회사 System for deduplicating the data and method for the same

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102684827A (en) * 2012-03-02 2012-09-19 华为技术有限公司 Data processing method and data processing equipment
CN102831222A (en) * 2012-08-24 2012-12-19 华中科技大学 Differential compression method based on data de-duplication

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于极值点分块的重复数据检测算法;谢垂益等;《技术研究》;20130831(第8期);正文第
数据备份系统中冗余数据的高性能消除技术研究;夏文;《中国博士学位论文全文数据库(信息科技辑)》;20150715(第7期);正文第96-106页

Also Published As

Publication number Publication date
CN105515586A (en) 2016-04-20

Similar Documents

Publication Publication Date Title
US11947494B2 (en) Organizing prime data elements using a tree data structure
TWI676903B (en) Lossless reduction of data by deriving data from prime data elements resident in a content-associative sieve
US8200641B2 (en) Dictionary for data deduplication
CN105515586B (en) A kind of quick residual quantity compression method
US20180196609A1 (en) Data Deduplication Using Multi-Chunk Predictive Encoding
Treeratpituk et al. Name-ethnicity classification and ethnicity-sensitive name matching
CN106557777B (en) One kind being based on the improved Kmeans document clustering method of SimHash
JP6447161B2 (en) Semantic structure search program, semantic structure search apparatus, and semantic structure search method
CN106611035A (en) Retrieval algorithm for deleting repetitive data in cloud storage
CN103248369A (en) Compression system and method based on FPFA (Field Programmable Gate Array)
CN101329680B (en) Large scale rapid matching method of sentence surface
US20170344579A1 (en) Data deduplication
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
JP6726690B2 (en) Performing multidimensional search, content-associative retrieval, and keyword-based retrieval and retrieval on losslessly reduced data using basic data sieves
CN110083743B (en) Rapid similar data detection method based on unified sampling
EP2856359B1 (en) Systems and methods for storing data and eliminating redundancy
KR102026125B1 (en) Lightweight complexity based packet-level deduplication apparatus and method, storage media storing the same
US9176973B1 (en) Recursive-capable lossless compression mechanism
Nurshafiqah et al. Data deduplication for similar files
US11847333B2 (en) System and method for sub-block deduplication with search for identical sectors inside a candidate block
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN104424268B (en) Data de-duplication method and equipment
Roussev et al. Class-aware similarity hashing for data classification
Wang et al. Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmission
US20120259897A1 (en) Determination of landmarks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant