CN105515586B - A kind of quick residual quantity compression method - Google Patents
A kind of quick residual quantity compression method Download PDFInfo
- Publication number
- CN105515586B CN105515586B CN201510927001.5A CN201510927001A CN105515586B CN 105515586 B CN105515586 B CN 105515586B CN 201510927001 A CN201510927001 A CN 201510927001A CN 105515586 B CN105515586 B CN 105515586B
- Authority
- CN
- China
- Prior art keywords
- word
- quick
- data
- block
- data block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3091—Data deduplication
Abstract
The invention discloses a kind of quick residual quantity compression methods, it include: that the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, to constitute word library, quick cutting based on content is carried out to data block A similar with reference block B, and the repeated word detected during quick cutting is amplified, to obtain repeated word and non-duplicate word, obtained repeated word and non-duplicate word are successively encoded and stored by cutting sequence, and repeated word and non-duplicate word are recorded using two different data formats respectively, to obtain differential data block △B,A, needing to differential data block △B,AWhen being decoded operation, successively from △B,ATo successively obtain all words of data block A output stream is written, to recover complete data block A in these order of words by the middle record for obtaining two kinds of data formats.The present invention has many advantages, such as that repeated word search efficiency is fast, and computing cost is small and efficiency of data compression is high.
Description
Technical field
The invention belongs to the field of data compression of computer storage, more particularly, to a kind of quick residual quantity compression method.
Background technique
In recent years, as the development of computer technology and network is universal, global data information memory amount is in explosive increasing
Long trend.Although the price of storage equipment is ceaselessly declining always, also much it is unable to catch up with the speed of data augmentation growth.
Data de-duplication (Data Deduplication) as it is a kind of by a large scale effectively eliminate redundant data technology,
Hot spot as storage system research in recent years.Data de-duplication can not only save memory space greatly to improve
The utilization rate of storage resource, and can be by avoiding the transmission of redundant data from improving the efficiency of transmission of network bandwidth.
But with the development of data de-duplication technology, data de-duplication technology also faces many challenges.Due to
Traditional data de-duplication technology is the fingerprint based on data block to carry out repeated data judgement, so which has limited repeat numbers
Complete duplicate data block can only be identified according to deleting technique, and cannot identify those much like data blocks.Such as two data
The block A1 and A2 only different situation of several bytes, although A1 and A2, close to completely similar, data de-duplication technology can produce
Raw completely different data fingerprint is to have ignored the redundancy processing to these set of metadata of similar data.Then residual quantity compresses (Delta
Compression) technology, which is just suggested, applies in this occasion, and residual quantity compression is an efficient data compression technique, it
It can be according to reference data block (or reference data block) ArTo its set of metadata of similar data block AiCarry out high compression.Data block it is similar
Degree is higher, then compression efficiency is higher.As shown by the equation, ArAnd AiDelta algorithm device is inputted, Delta algorithm device can export one
Differential data (is denoted as Δr,i) represent file AiCompressed version.It such as needs to decompress data Ai, then differential data Δ is readr,iAnd base
Quasi- data block ArData A can be calculatedi。
So residual quantity compression is with data de-duplication technology comparatively, residual quantity compression can eliminate non-duplicate but phase
As redundant data, to obtain bigger data compression ratio.
However, existing residual quantity compress technique has the following problems: its compressed encoding is slow, and index expense is big, data compression
Low efficiency, poor expandability;By taking the Xdelta residual quantity compression algorithm for the Univ. of California, Berkeley classics being widely used at present as an example,
Its compressed encoding rate is slowly poor so only about 30-60MB/s (using tetra- core Xeon 2.6Ghz processor of intel)
Amount compressed encoding rate seriously limits the popularization and development of the algorithm.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of quick residual quantity compression method,
Purpose is, by the way that set of metadata of similar data block (or file) is based on the quick cutting word of content, calculates word Hash, index lookup weight
The operations such as multiple word, identify the different data between set of metadata of similar data block, final residual quantity code storage are realized, to save storage
Space, and solve that compressed encoding present in existing residual quantity compress technique is slow, index expense is big, efficiency of data compression is low, can expand
The technical problem of malleability difference.
To achieve the above object, according to one aspect of the present invention, a kind of quick residual quantity compression method is provided, including with
Lower step:
(1) the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, thus structure
At word library;
(2) the quick cutting based on content is carried out to data block A similar with reference block B, and to during quick cutting
The repeated word detected amplifies, to obtain repeated word and non-duplicate word;
(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and are stored by cutting sequence,
And repeated word and non-duplicate word are recorded using two different data formats respectively, to obtain differential data block ΔB,A;
(4) it is needing to differential data block ΔB,AWhen being decoded operation, successively from ΔB,ATwo kinds of data formats of middle acquisition
Record output stream is written into these order of words to successively obtain all words of data block A, it is complete to recover
Data block A.
Preferably, step (1) includes following sub-step:
(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged;
Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position ii, wherein Bi
Indicate byte content of the reference block B at current sliding position i;
It is all 0 that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, is
(1-4) is then entered step, (1-5) is otherwise entered step;
(1-4) mark position i is the end of a word, and calculates fingerprint to the word using fingerprint algorithm, to establish
The fingerprint index of the word is arranged f=1, enters step (1-5);
I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), until handled data block B last
Until a byte.
Preferably, step (2) includes following sub-step:
(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged;
(2-2) calculates quick sliding cryptographic Hash g, g=(g < < the 1)+A of data block A at the j of current locationj, wherein AjTable
Show byte content of the data block A at current sliding position j;
It is all 0 that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, is
Then enter step (2-4);Otherwise it is transferred to step (2-7);
(2-4) mark position j is the end of a word W, and is utilized with identical fingerprint algorithm in step (1-4) to this
Word W calculates fingerprint h;
(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if
It is to enter step (2-6);Otherwise label word W is non-duplicate word, and f=1 is arranged, enters step (2-7);
It is repeated word that (2-6), which marks word W, and it is subsequent to continue repeated word V of the comparison word W and W in reference block B
Byte content it is whether identical, once as soon as encounter a different byte, stop compare, finally obtain subsequent identical byte
The length of number k, setting word W increase k, and j=j+k, f=1, and return step (2-2) is arranged;
J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), until handled data block A last
Until a byte.
Preferably, step (3) is specifically, use position and length of ' 0 ' the data format record repeated word in reference block B
Information is spent, uses ' 1 ' data format to record the length information and byte content of non-duplicate word, to obtain data block A for base
The differential data block Δ of quasi- block BB,A。
Preferably, step (4) is specifically, record for ' 0 ' data format, according to position and length information, from benchmark
Repeated word is obtained in block B;For the record of ' 1 ' data format, non-duplicate word is directly taken out from the record.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show
Beneficial effect:
1, the present invention passes through method (i.e. step (1-2), (1-3), (2-2) and (2- based on the quick cutting word of content
3)), sliding only needs a shift left operation and an add operation every time, has reached quick cutting word effect, while
The word for largely there are overlapped contents that conventional compression coding generates is avoided, the word in residual quantity compression encoding process is simplified
With process, reduce corresponding memory and computing cost, and can guarantee very high residual quantity compression efficiency.
2, the present invention amplifies strategy by repeated word, directly compares whether content weighs backward to the repeated word detected
It is multiple, it avoids to calculating operations such as the piecemeals, Hash, retrieval of repeated word further part byte, accelerates residual quantity compressed encoding
Process;Simultaneously because repeated word amplification reduces the record number in differential data, so also accelerating decoded process.And
And this method can safeguard the locality of repeated word, ensure higher efficiency of data compression.
3, of the invention positioned rapidly based on the quick cutting word method help repeated word amplification method of content has looked for weight
Multiple word, and repeated word amplification method then helps residual quantity compression to reduce the calculating such as piecemeal, Hash of further part.So this two
The combination of person brings out the best in each other, and obtains cracking residual quantity compressed encoding speed.Repeated word amplification method is maximumlly coveted simultaneously
The heart detects the subsequent more duplicate contents of repeated word, has ensured higher residual quantity compression efficiency.
4, the process proposed by the present invention for merging non-duplicate word and being encoded, for multiple continuous non-duplicate words,
Using primary record, residual quantity compressed encoding and decoded process are accelerated.
Detailed description of the invention
Fig. 1 is the flow chart of quickly residual quantity compaction coding method of the invention.
Fig. 2 is the quick sliding hash method schematic diagram that the present invention uses.
Fig. 3 is the comparison schematic diagram based on content piecemeal Yu conventional compression method that the present invention uses.
Fig. 4 is the word amplification method schematic diagram that the present invention uses.
Fig. 5 is the coded format schematic diagram that the present invention uses.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below
Not constituting a conflict with each other can be combined with each other.
As shown in Figure 1, residual quantity compression of the invention mainly includes three parts, based on the quick cutting word of content, repeat
Word amplification method, the residual quantity coding for merging non-duplicate word.
Proposed by the present invention to be based on the quick cutting word method of content, avoiding conventional compression coding and generating largely has weight
The word of folded content, simplifies the word matched process in residual quantity compression encoding process.As shown in Fig. 2, traditional residual quantity is compressed
Method produces a large amount of overlapping character strings, to wish to find more repeated words;It is proposed by the present invention quick based on content
Cutting word method produces less word, while being also avoided that position offset problem caused by content modification: guaranteeing to arrive
Identical content can generate same word cut point.The advantages of this method: sliding only needs a sub-addition and moves to left every time
Operation, memory and computing cost are small, while can guarantee very high residual quantity compression efficiency.
Repeated word amplification method proposed by the present invention, directly to the repeated word detected matching content backward, so that it may
To avoid piecemeal and the fingerprint calculating of magnification region, index operation etc..This method is the word locality characteristic based on data flow
Come what is be unfolded, the word locality in storage system refers to, when word once occurs with sequence A, B, C, then next time occurs
When word A, word B and C are likely to follow closely below.The present invention excavates the word locality of this data flow to carry out residual quantity
Compressed encoding work, as shown in figure 4, for the word sequence of front and back set of metadata of similar data block twice: B1、B2、B3、B4、B5And A1、A2、
A3、A4、A5, B can be determined using the method based on the quick cutting word of content2And A2It repeats.According to principle of locality above-mentioned,
Word B3、B4、B5Probably respectively with word A3、A4、A5It repeats respectively, such B3、B4、B5And A3、A4、A5Time-consuming point
The operations such as block, Hash, index are just not necessarily to, it is only necessary to by word from B2And A2It compares backward and is expanded to { B2、B3、B4、B5}
{ A2、A3、A4、A5}.Meanwhile the method for repeated word amplification is not limited to the boundary divided based on content words, it may be final
Position of the position of stopping inside some word, can thus find more duplicate contents.So this repeated word
As long as the method for amplification simply compares subsequent content, to simplify and speed up the encoding operation of residual quantity compression, and guarantee to obtain
Higher efficiency of data compression.
The residual quantity coding proposed by the present invention for merging non-duplicate word, as shown in figure 5, for continuous non-duplicate word,
It is merged into a word coding, residual quantity compression efficiency is improved, can also accelerate the decoding process of residual quantity compression.Meanwhile according to upper
The combination for stating multinomial technology, has significantly speeded up the coding and decoding process of residual quantity compression, while having ensured efficiency of data compression.
As shown in Figure 1, quick residual quantity compression method of the invention, be for known data block (or file) A and and its
Similar reference data block B carries out quick residual quantity compressed encoding, comprising the following specific steps
(1) the quick cutting (FastRolling) based on content is carried out to the reference block B in residual quantity compression, it is more to obtain
A word, to constitute word library;This step specifically includes following sub-step:
(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged;
Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position ii, wherein Bi
Indicate byte content of the reference block B at current sliding position i, as shown in Figure 2;
The hash algorithm both ensure that by a shift left operation and an add operation sliding Hash function (i.e. this
Cryptographic Hash can be obtained from last time cryptographic Hash, and old byte content will gradually be removed from cryptographic Hash with shift left operation, new
Byte content enters new cryptographic Hash by additional calculation), at the same it is fast but also with calculating speed, and cryptographic Hash randomness waits by force spies
Point.
It is all 0 (i.e. that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position,
Judge f& (2p- 1) whether value is 0;The value of p meets following relationship: the average length of the word obtained after cutting reference block B
It is approximately equal to 2pA byte, for example, if word average length is approximately 64, p=6,;General 4≤p≤13, i.e. word are flat
Equal length control is in 16 bytes to the range between 8192 bytes), it is to enter step (1-4), otherwise enters step (1-
5);
Here the quick cutting word method based on content ensure that can generate identical list on the position of identical content
Word cut point.Can by adjusting word average length (setting p be different numerical value) control residual quantity compression effectiveness, it is general and
Speech, the setting of word average length is bigger, and the granularity of residual quantity compression processing is bigger, and calculating is faster, and compression ratio is lower;Word is average
Length setting is smaller, and the granularity of residual quantity compression processing is just smaller, and calculating is slower, and compression ratio is also higher.
As shown in figure 3, being searched based on the word that the quick cutting word of content produces a small amount of number for repeated word, keep away
Exempt from the word for largely there are overlapped contents that conventional compression coding generates, simplifies the mistake of the word matched in residual quantity compressed encoding
Journey.
(1-4) mark position i is the end of a word, and calculates fingerprint (in this reality to the word using fingerprint algorithm
It applies in mode, the fingerprint algorithm used to establish the fingerprint index of the word, is arranged f=1, enters step (1- for xxHash)
5);
I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), until handled data block B last
Until a byte.
(2) the quick cutting based on content is carried out to data block A similar with reference block B, and to during quick cutting
The repeated word detected amplifies, to obtain repeated word and non-duplicate word;This step specifically includes following sub-step:
(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged;
(2-2) calculates quick sliding cryptographic Hash g, g=g < < 1+A of the data block A at the j of current locationj, wherein AiIndicate number
According to byte content of the block A at current sliding position j;
It is all 0 (i.e. that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position,
Judge f& (2p- 1) whether value is 0, and p value is identical as step (1-3) here), it is to enter step (2-4);Otherwise it is transferred to
Step (2-7);
(2-4) mark position j is the end of a word W, and is utilized with identical fingerprint algorithm in step (1-4) to this
Word W calculates fingerprint h, and (fingerprint algorithm in the present embodiment, used is xxHash);
(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if
(h exists in fingerprint index, and corresponding word V and word W byte content also exactly match in reference block B), then into
Enter step (2-6);Otherwise label word W is non-duplicate word, and f=1 is arranged, enters step (2-7);
(2-6) mark word W be repeated word, and continue compare the subsequent byte content of word W and V it is whether identical, one
As soon as denier encounters a different byte, stop comparing (as shown in Figure 4).Subsequent identical byte number k is finally obtained, setting is single
The length of word W increases k;J=j+k, f=1, and return step (2-2) are set;
Whether identical continue the comparison subsequent byte content of repeated word W and V, is to repeat list using proposed by the present invention
Word amplification method.The region of amplification does not need to carry out time-consuming based on the operation such as content piecemeal, fingerprint calculating, fingerprint index, section
Time overhead is saved.
J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), until handled data block A last
Until a byte.
(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and are stored by cutting sequence,
Repeated word and non-duplicate word are recorded, using two different data formats respectively to obtain differential data block ΔB,A;Tool
For body, as shown in figure 5, referring to two kinds of data formats with ' 0 ' and ' 1 ' generation respectively, ' 0 ' data format record repeated word is used to exist
Position and length information in reference block B use ' 1 ' data format to record the length information and byte content of non-duplicate word,
(Δ is denoted as the differential data block of reference block B to obtain data block AB,A)。
(4) it is needing to differential data block ΔB,AWhen being decoded operation, successively from ΔB,ATwo kinds of data formats of middle acquisition
Record output stream is written into these order of words to successively obtain all words of data block A, it is complete to recover
Data block A;Specifically, as shown in figure 5, record for ' 0 ' data format, according to position and length information, from reference block B
Middle acquisition repeated word;For the record of ' 1 ' data format, non-duplicate word is directly taken out from the record.
To sum up, the present invention can speed up residual quantity compression encoding process, and have repeated word search efficiency fast, calculates
The advantages that expense is small and efficiency of data compression is high.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include
Within protection scope of the present invention.
Claims (4)
1. a kind of quick residual quantity compression method, comprising the following steps:
(1) the quick cutting based on content is carried out to the reference block B in residual quantity compression, to obtain multiple words, to constitute list
Dictionary;
(2) the quick cutting based on content is carried out to data block A similar with reference block B, and is detected to during quick cutting
To repeated word amplify, to obtain repeated word and non-duplicate word;
Wherein, step (2) includes following sub-step:
(2-1) initializes quick sliding cryptographic Hash g=1, and the current sliding position j=1 of data block A is arranged;
(2-2) calculates quick sliding cryptographic Hash g, g=(g < < the 1)+A of data block A at the j of current locationj, wherein AjIndicate data
Byte content of the block A at current sliding position j;
It is all 0 that whether the quick sliding cryptographic Hash g that (2-3) judgment step (2-2) is calculated, which meets minimum p position, be then into
Enter step (2-4), is otherwise transferred to step (2-7);
(2-4) mark position j is the end of a word W, and calculates fingerprint h to word W using fingerprint algorithm;
(2-5) searches the word library of reference block B according to fingerprint h, detects whether W is repeated word by fingerprint index, if then
Enter step (2-6);Otherwise label word W is non-duplicate word, and g=1 is arranged, enters step (2-7);
It is repeated word that (2-6), which marks word W, and continues to compare repeated word V subsequent word of the word W and W in reference block B
Whether identical save content, once as soon as encounter a different byte, stop comparing, finally obtain subsequent identical byte number k,
The length that word W is arranged increases k, and j=j+k, g=1, and return step (2-2) is arranged;
J=j+1 is arranged in (2-7), and repeats step (2-2) and (2-3), the last character until having handled data block A
Until section;
(3) repeated word obtained in step (2) and non-duplicate word are successively encoded and is stored by cutting sequence, and point
Repeated word and non-duplicate word are not recorded, using two different data formats to obtain differential data block △B,A;
(4) it is needing to differential data block △B,AWhen being decoded operation, successively from △B,AThe middle note for obtaining two kinds of data formats
To successively obtain all words of data block A output stream is written, to recover complete data in these order of words by record
Block A.
2. quick residual quantity compression method according to claim 1, which is characterized in that step (1) includes following sub-step:
(1-1) initializes quick sliding cryptographic Hash f=1, and the current sliding position i=1 of reference block B is arranged;
Quick sliding cryptographic Hash f, f=(f < < the 1)+B of (1-2) calculating benchmark block B at current sliding position ii, wherein BiIt indicates
Byte content of the reference block B at current sliding position i;
It is all 0 that whether the quick sliding cryptographic Hash f that (1-3) judgment step (1-2) is calculated, which meets minimum p position, be then into
Enter step (1-4), otherwise enters step (1-5);
(1-4) mark position i is the end of a word, and calculates fingerprint to the word using fingerprint algorithm, to establish the list
The fingerprint index of word is arranged f=1, enters step (1-5);
I=i+1 is arranged in (1-5), and repeats step (1-2) and (1-3), the last character until having handled data block B
Until section.
3. quick residual quantity compression method according to claim 1, which is characterized in that step (3) is specifically, use ' 0 ' number
According to position and length information of the format record repeated word in reference block B, ' 1 ' data format is used to record non-duplicate word
Length information and byte content, to obtain data block A for the differential data block △ of reference block BB,A。
4. quick residual quantity compression method according to claim 1, which is characterized in that step (4) is specifically, for ' 0 ' number
Repeated word is obtained from reference block B according to position and length information according to the record of format;For the note of ' 1 ' data format
Record, directly takes out non-duplicate word from the record.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927001.5A CN105515586B (en) | 2015-12-14 | 2015-12-14 | A kind of quick residual quantity compression method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510927001.5A CN105515586B (en) | 2015-12-14 | 2015-12-14 | A kind of quick residual quantity compression method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105515586A CN105515586A (en) | 2016-04-20 |
CN105515586B true CN105515586B (en) | 2019-04-12 |
Family
ID=55723301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510927001.5A Active CN105515586B (en) | 2015-12-14 | 2015-12-14 | A kind of quick residual quantity compression method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105515586B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844479B (en) * | 2016-12-23 | 2020-07-07 | 光锐恒宇(北京)科技有限公司 | Method and device for compressing and decompressing file |
CN108268628A (en) * | 2018-01-15 | 2018-07-10 | 深圳前海信息技术有限公司 | The method and device of delta compression based on dynamic anchor point |
CN110083743B (en) * | 2019-03-28 | 2021-11-16 | 哈尔滨工业大学(深圳) | Rapid similar data detection method based on unified sampling |
CN111796969A (en) * | 2020-05-29 | 2020-10-20 | 湖北工业大学 | Data difference compression detection method, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102684827A (en) * | 2012-03-02 | 2012-09-19 | 华为技术有限公司 | Data processing method and data processing equipment |
CN102831222A (en) * | 2012-08-24 | 2012-12-19 | 华中科技大学 | Differential compression method based on data de-duplication |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130087850A (en) * | 2012-01-30 | 2013-08-07 | 삼성전자주식회사 | System for deduplicating the data and method for the same |
-
2015
- 2015-12-14 CN CN201510927001.5A patent/CN105515586B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102684827A (en) * | 2012-03-02 | 2012-09-19 | 华为技术有限公司 | Data processing method and data processing equipment |
CN102831222A (en) * | 2012-08-24 | 2012-12-19 | 华中科技大学 | Differential compression method based on data de-duplication |
Non-Patent Citations (2)
Title |
---|
基于极值点分块的重复数据检测算法;谢垂益等;《技术研究》;20130831(第8期);正文第 |
数据备份系统中冗余数据的高性能消除技术研究;夏文;《中国博士学位论文全文数据库(信息科技辑)》;20150715(第7期);正文第96-106页 |
Also Published As
Publication number | Publication date |
---|---|
CN105515586A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11947494B2 (en) | Organizing prime data elements using a tree data structure | |
TWI676903B (en) | Lossless reduction of data by deriving data from prime data elements resident in a content-associative sieve | |
US8200641B2 (en) | Dictionary for data deduplication | |
CN105515586B (en) | A kind of quick residual quantity compression method | |
US20180196609A1 (en) | Data Deduplication Using Multi-Chunk Predictive Encoding | |
Treeratpituk et al. | Name-ethnicity classification and ethnicity-sensitive name matching | |
CN106557777B (en) | One kind being based on the improved Kmeans document clustering method of SimHash | |
JP6447161B2 (en) | Semantic structure search program, semantic structure search apparatus, and semantic structure search method | |
CN106611035A (en) | Retrieval algorithm for deleting repetitive data in cloud storage | |
CN103248369A (en) | Compression system and method based on FPFA (Field Programmable Gate Array) | |
CN101329680B (en) | Large scale rapid matching method of sentence surface | |
US20170344579A1 (en) | Data deduplication | |
CN108475508B (en) | Simplification of audio data and data stored in block processing storage system | |
JP6726690B2 (en) | Performing multidimensional search, content-associative retrieval, and keyword-based retrieval and retrieval on losslessly reduced data using basic data sieves | |
CN110083743B (en) | Rapid similar data detection method based on unified sampling | |
EP2856359B1 (en) | Systems and methods for storing data and eliminating redundancy | |
KR102026125B1 (en) | Lightweight complexity based packet-level deduplication apparatus and method, storage media storing the same | |
US9176973B1 (en) | Recursive-capable lossless compression mechanism | |
Nurshafiqah et al. | Data deduplication for similar files | |
US11847333B2 (en) | System and method for sub-block deduplication with search for identical sectors inside a candidate block | |
CN111159996B (en) | Short text set similarity comparison method and system based on text fingerprint algorithm | |
CN104424268B (en) | Data de-duplication method and equipment | |
Roussev et al. | Class-aware similarity hashing for data classification | |
Wang et al. | Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmission | |
US20120259897A1 (en) | Determination of landmarks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |