CN108268628A - The method and device of delta compression based on dynamic anchor point - Google Patents

The method and device of delta compression based on dynamic anchor point Download PDF

Info

Publication number
CN108268628A
CN108268628A CN201810035223.XA CN201810035223A CN108268628A CN 108268628 A CN108268628 A CN 108268628A CN 201810035223 A CN201810035223 A CN 201810035223A CN 108268628 A CN108268628 A CN 108268628A
Authority
CN
China
Prior art keywords
anchor point
target
data stream
offset
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810035223.XA
Other languages
Chinese (zh)
Inventor
张宇弘
王界兵
张伟
董迪马
耿涛
黄嘉乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Frontsurf Information Technology Co Ltd
Original Assignee
Shenzhen Frontsurf Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Frontsurf Information Technology Co Ltd filed Critical Shenzhen Frontsurf Information Technology Co Ltd
Priority to CN201810035223.XA priority Critical patent/CN108268628A/en
Publication of CN108268628A publication Critical patent/CN108268628A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a kind of method and device of the delta compression based on dynamic anchor point, wherein, this method comprises the following steps:According to hash algorithm scanning target data stream and reference data stream is rolled, the identical target candidate anchor point of cryptographic Hash will be rolled with being labeled as an anchor point pair with reference to candidate anchor point;Using anchor point to reference data stream and target data stream are divided into multiple paragraphs respectively;For the paragraph of no change, then record the section of paragraph and carry out coded treatment;For there is the paragraph of change, then when target data stream and reference data stream flow into matching module progress string matching, according to the anchor point detected to coming automatic aligning target window or reference windows;Coded treatment is carried out to the result of string matching;And outputting encoded data.Technical scheme of the present invention can simplify calculating, improve operation efficiency;A large amount of memory sources on chip can be saved, it can be with hardware realization.

Description

The method and device of delta compression based on dynamic anchor point
Technical field
The present invention relates to technical field of data processing more particularly to a kind of delta compression based on dynamic anchor point method and Device.
Background technology
At present, most of compress techniques are related to handling individual traffic, and different, delta compression is then to pass through meter The increment calculated between target data stream and reference data stream carries out data compression.Increment can be considered as target and reference data stream it Between difference coding, therefore, target data stream can also be restored by increment and reference data stream.
Delta compression is initially applied in version control system.Pass through the number for storing the increment of different editions to substitute practical According to system storage demand can be significantly reduced.For example, the Xdelta file system (XDFS) of MacDonald is exactly to utilize increment pressure What contracting was realized.Another application of delta compression is software distribution, and the software particularly distributed on the internet is particularly typical.It is soft Part can dramatically reduction network flow by distributing increment or patch.In addition, delta compression can also be used for improving HTTP performances.It should Technology effectively reduces web access using the similitude between the different pages of appointed website or the different editions of named web page Delay.The VCDIFF being defined in RFC supports the usage.However, as certain deletions or insertion operation, reference data is often not It can be matched with target data.And excessive, the target inputted in reference data window if reference data and target data misplace Data cannot find matched character string, be substantially reduced so as to cause compression ratio.Currently used several delta compressions Device is not avoided that this problem, including xdetla, vdelta (and its newer VCDIFF) and zdelta.
In view of this, it is necessary to propose that current delta compression technology is further improved.
Invention content
To solve an above-mentioned at least technical problem, the main object of the present invention is to provide a kind of increment based on dynamic anchor point The method of compression.
To achieve the above object, one aspect of the present invention is:A kind of increment based on dynamic anchor point is provided The method of compression, including:
Target data stream and reference data stream, the target anchor point identical by cryptographic Hash is rolled are scanned according to hash algorithm is rolled With being labeled as an anchor point pair with reference to anchor point, wherein, the anchor point is to being expressed as (offset of a relatively upper target anchor point, phase To upper one offset with reference to anchor point);
Using anchor point to target data stream and reference data stream are divided into multiple paragraphs respectively;
For the paragraph of no change, then record the section of paragraph and carry out coded treatment;
For there is the paragraph of change, then flow into matching module in target data stream and reference data stream and carry out string matching When, according to the anchor point detected to coming automatic aligning target window or reference windows, wherein, the target window can accommodating portion The data of target data stream, the reference windows can accommodating portion reference data stream data;
Coded treatment is carried out to the result of string matching;And outputting encoded data.
Wherein, it is described according to rolling hash algorithm scanning target data stream and reference data stream, it is identical by cryptographic Hash is rolled Target anchor point and the step with reference to anchor point labeled as anchor point pair, specifically include:
Using hash algorithm scanning target data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value are compared with default Hash characteristic value, if equal, are recorded as with reference to candidate anchor point;
Using hash algorithm scanning reference data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value, if equal, are recorded as target candidate anchor point compared with default Hash characteristic value;
Whether comparison reference candidate anchor point is identical with the rolling cryptographic Hash of target candidate anchor point, will be referred to when the two is identical Candidate anchor point is determined as with reference to anchor point and target candidate anchor point is determined as target anchor point;
Anchor point will be referred to and target anchor point is marked as an anchor point pair.
Wherein, the step of anchor point that the basis detects is to coming automatic aligning target window or reference windows, further includes:
Judge the sequencing of the reference anchor point of reference data stream and the target anchor point of target data stream;
If the time detected with reference to anchor point before target anchor point, i.e., is less than target anchor point with reference to the offset of anchor point Offset then suspends reference data stream and flows to reference windows, and continues to search for flowing into the target data stream of target window, until looking for Until same target anchor point, continue to execute and string matching is carried out to target data stream and reference data stream;If with reference to anchor The time of point detection after target anchor point, i.e., is more than the offset of target anchor point with reference to the offset of anchor point, then suspends target Data flow direction target window, and continue to search for flowing into the reference data stream of reference windows, it is up to finding the same anchor point that refers to Only, continue to carry out string matching to target data stream and reference data stream.
Wherein, the target data stream and reference data stream flow into the step of matching module carries out string matching, specifically Including:
The target data stream for flowing into reference windows and reference data are done by limit matching according to LZ77 algorithms, obtain output knot Fruit, the output result are matching unit [offset, matching length] and character cell.
Wherein, the outputting encoded data include without the paragraph of change outputting encoded data and have the paragraph of change Outputting encoded data;
The outputting encoded data for the paragraph do not changed with reference to anchor point and bout length by forming;
It is made of the outputting encoded data of the paragraph of change reference anchor point, matching unit and character cell;Wherein, institute State the offset being labeled as with reference to anchor point relative to upper one with reference to anchor point;The offset of the matching unit is relative to current With reference to the offset of anchor point.
To achieve the above object, another technical solution used in the present invention is:A kind of increasing based on dynamic anchor point is provided The device of compression is measured, including:
Determining module, for according to hash algorithm scanning target data stream and reference data stream is rolled, cryptographic Hash will to be rolled Identical target anchor point is labeled as an anchor point pair with reference anchor point, wherein, the anchor point is to being expressed as (relatively upper a target anchor The offset of point, a relatively upper offset with reference to anchor point);
Paragraph division module, for using anchor point to target data stream and reference data stream are divided into multiple paragraphs respectively;
First processing module is handled for the paragraph to no change, including recording the section of paragraph and being compiled Code processing;
Second processing module for handling the paragraph for having change, is included in target data stream and reference data stream When flowing into matching module progress string matching, according to the anchor point detected to coming automatic aligning target window or reference windows, Wherein, the target window can accommodating portion target data stream data, the reference windows can accommodating portion reference data stream Data;
Coding module, for carrying out coded treatment to the result of string matching;And
Output module, for outputting encoded data.
Wherein, the determining module, is specifically used for:
Using hash algorithm scanning target data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value are compared with default Hash characteristic value, if equal, are recorded as with reference to candidate anchor point;
Using hash algorithm scanning reference data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value, if equal, are recorded as target candidate anchor point compared with default Hash characteristic value;
Whether comparison reference candidate anchor point is identical with the rolling cryptographic Hash of target candidate anchor point, will be referred to when the two is identical Candidate anchor point is determined as with reference to anchor point and target candidate anchor point is determined as target anchor point;
Anchor point will be referred to and target anchor point is marked as an anchor point pair.
Wherein, the Second processing module, is specifically used for:
Judge the sequencing of the reference anchor point of reference data stream and the target anchor point of target data stream;
If the time detected with reference to anchor point before target anchor point, i.e., is less than target anchor point with reference to the offset of anchor point Offset then suspends reference data stream and flows to reference windows, and continues to search for flowing into the target data stream of target window, until looking for Until same target anchor point, continue to execute and string matching is carried out to target data stream and reference data stream;If with reference to anchor The time of point detection after target anchor point, i.e., is more than the offset of target anchor point with reference to the offset of anchor point, then suspends target Data flow direction target window, and continue to search for flowing into the reference data stream of reference windows, it is up to finding the same anchor point that refers to Only, continue to carry out string matching to target data stream and reference data stream.
Wherein, the Second processing module, is additionally operable to:
The target data stream for flowing into reference windows and reference data are done by limit matching according to LZ77 algorithms, obtain output knot Fruit, the output result are matching unit [offset, matching length] and character cell.
Wherein, the outputting encoded data include without the paragraph of change outputting encoded data and have the paragraph of change Outputting encoded data;
The outputting encoded data for the paragraph do not changed with reference to anchor point and bout length by forming;
It is made of the outputting encoded data of the paragraph of change reference anchor point, matching unit and character cell;Its In, the offset being labeled as with reference to anchor point relative to upper one with reference to anchor point;The offset of the matching unit is opposite In the offset of current reference anchor point.
Technical scheme of the present invention mainly scans target data stream and reference data stream using first according to rolling hash algorithm, By the target anchor point matched with being labeled as an anchor point pair with reference to anchor point;Using anchor point to target data stream and reference data Stream is divided into multiple paragraphs;For the paragraph of no change, then record the section of paragraph and carry out coded treatment;For there is change Paragraph, then when target data stream and reference data stream carry out string matching in the form of transmitting as a stream, according to the anchor detected Any one automatic aligning target window or reference windows of point centering, then carry out coded treatment to the result of string matching; Last outputting encoded data, this programme pass through the dynamic anchor point of setting by using reference windows more smaller than other tools Target data stream and reference data stream are detected, with the performance that this simplifies computational complexity and improves system;In addition, smaller ginseng Hardware realization can also be made it possible by saving a large amount of memory sources on chip by examining window.
Description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Structure according to these attached drawings obtains other attached drawings.
Fig. 1 shows that text is inserted into the schematic diagram of target data;
Fig. 2 shows existing data compression schematic diagrames;
Fig. 3 shows that present invention introduces dynamic anchor points to realize data compression schematic diagram;
Fig. 4 shows the flow chart of the method for delta compression of the one embodiment of the invention based on dynamic anchor point;
Fig. 5 shows that the present invention searches candidate anchor point schematic diagram using hash algorithm;
Fig. 6 shows the flow chart of the method for delta compression of the specific embodiment based on dynamic anchor point of the invention;
Fig. 7 shows the block diagram of the device of invention delta compression of one embodiment based on dynamic anchor point.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only the part of the embodiment of the present invention, instead of all the embodiments.Base Embodiment in the present invention, those of ordinary skill in the art obtained without creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
It is to be appreciated that the description of " first ", " second " etc. involved in the present invention be only used for description purpose, and it is not intended that Indicate or imply its relative importance or the implicit quantity for indicating indicated technical characteristic.Define as a result, " first ", At least one this feature can be expressed or be implicitly included to the feature of " second ".In addition, the technical side between each embodiment Case can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when the combination of technical solution Conflicting or can not realize when occur will be understood that the combination of this technical solution is not present, also not the present invention claims guarantor Within the scope of shield.
Fig. 1 is please referred to, Fig. 1 shows that text is inserted into the schematic diagram of target data;It will be seen from figure 1 that target data stream Had more y segment datas compared to reference data stream, which is identified and is encoded with delta compression, finally by by its He preserves compressed content by data referencing to reference data.At present, all delta compression devices all by the target data stream of input with Reference data stream is compared.
Fig. 2 is please referred to, Fig. 2 shows existing data compression schematic diagrames, and also some compressor reducers are also by the target of input Data are compared with previous target data (target histories).By source data (also referred to as target data) and reference windows and mesh Mark window is compared, to find matched character string.It, can when reference windows are sufficiently large, can keep entire reference data stream Realize optimum compression ratio.But in order to save resource, entire reference data and incoming target data are not compared by this. On the contrary, as most of compressibilities, only a part of reference data storage is compared in reference windows.Therefore, such as Data character string in fruit reference data stream not in reference windows, is answered matched character string that will not be found, is caused just Compression ratio will significantly reduce.By taking Fig. 1 as an example, if the data segment y in figure is more than the size of reference windows, in reference windows Target data after y can not find matched character string.For this purpose, it is asked invention introduces dynamic anchor point to solve above-mentioned technology Topic.Identical content part can be marked out by dynamic anchor point between reference and target data stream.
Fig. 3 is please referred to, Fig. 3 shows that present invention introduces dynamic anchor points to realize data compression schematic diagram.Compared to existing skill Art, this programme is before string matching, and dynamic anchor point is by scanning reference data stream and target data Stream Discovery.In character string With period, compressor reducer adjusts reference windows pointer according to dynamic anchor point;If reference offset is more than target offset, compressor reducer is with faster Speed pull reference data, some texts are deleted.If reference offset is less than target offset amount, compressor reducer stops reference window Mouthful, it is meant that some texts are inserted into target data, and concrete implementation method please refers to following embodiments.
Fig. 4 is please referred to, Fig. 4 shows the flow of the method for delta compression of the one embodiment of the invention based on dynamic anchor point Figure.In embodiments of the present invention, the method for the delta compression based on dynamic anchor point is somebody's turn to do, is included the following steps:
Step S10, it is identical by cryptographic Hash is rolled according to rolling hash algorithm scanning target data stream and reference data stream Target anchor point with reference to anchor point labeled as an anchor point pair, wherein, the anchor point to be expressed as (a relatively upper target anchor point it is inclined Shifting amount, a relatively upper offset with reference to anchor point);
Step S20, using anchor point to reference data stream and target data stream are divided into multiple paragraphs respectively;
Step S30, it for the paragraph of no change, then records the section of paragraph and carries out coded treatment;
Step S40, it for there is the paragraph of change, then flows into matching module in target data stream and reference data stream and carries out word When according with String matching, according to the anchor point detected to coming automatic aligning target window or reference windows, wherein, the target window can The data of accommodating portion target data stream, the reference windows can accommodating portion reference data stream data;
Step S41, coded treatment is carried out to the result of string matching;And
Step S50, outputting encoded data.
In the above embodiments, the reference anchor point in target anchor point and reference data stream in target data stream can pass through Hash algorithm is rolled to determine.Roll the hash function that hash algorithm is the mobile computing cryptographic Hash using input in the window. Hash function allows quickly to calculate rolling Hash --- and new cryptographic Hash removes window by being deleted in old cryptographic Hash The new value that the old value of mouth and addition move into window is calculated.This is a kind of mode similar to rolling average function, and operation is fast Degree can be more faster than other low-pass filters.By target anchor point with being labeled as anchor point pair with reference to anchor point, then by target data Stream flows into target window and reference data stream is inputted reference windows, and according in detection target data stream or reference data stream The anchor point automatic aligning target window or reference windows of detection;Coded treatment then is carried out to the result of string matching;Finally Output data, this programme pass through intelligence alignment reference and target data so that most like data are included with reference to target window, Better compression ratio is realized with this.
Technical scheme of the present invention is main first according to rolling hash algorithm scanning target data stream and reference data stream, will The target anchor point mixed with reference to anchor point with being labeled as an anchor point pair;Using anchor point to target data stream and reference data flow point Into multiple paragraphs;For the paragraph of no change, then record the section of paragraph and carry out coded treatment;For there is the section of change It falls, then when target data stream and reference data stream carry out string matching in the form of transmitting as a stream, according to the anchor point detected Any one automatic aligning target window or reference windows of centering then carry out coded treatment to the result of string matching;Most Outputting encoded data afterwards, this programme by using reference windows more smaller than other tools, and by the dynamic anchor point of setting come Target data stream and reference data stream are detected, with the performance that this simplifies computational complexity and improves system;In addition, smaller reference Window can also make it possible hardware realization by saving a large amount of memory sources on chip.
It is described according to hash algorithm scanning target data stream and reference data stream is rolled in a specific embodiment, it will The identical target anchor point of cryptographic Hash and the step with reference to anchor point labeled as an anchor point pair are rolled, is specifically included:
Using hash algorithm scanning target data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value are compared with default Hash characteristic value, if equal, are recorded as with reference to candidate anchor point;
Using hash algorithm scanning reference data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value, if equal, are recorded as target candidate anchor point compared with default Hash characteristic value;
Whether comparison reference candidate anchor point is identical with the rolling cryptographic Hash of target candidate anchor point, will be referred to when the two is identical Candidate anchor point is determined as with reference to anchor point and target candidate anchor point is determined as target anchor point;
Anchor point will be referred to and target anchor point is marked as an anchor point pair.
In above-described embodiment, the target candidate anchor point in the candidate anchor point of reference and target data stream of reference data stream It can be determined by rolling hash algorithm.Roll the Hash letter that Hash is the mobile computing cryptographic Hash using input in the window Number.Particular flow sheet can refer to Fig. 5, and Fig. 5 shows that the present invention searches candidate anchor point schematic diagram using hash algorithm.With Rabin- For Karp algorithms, the algorithm is specific as follows usually using very simple rolling hash function:
Hk=(c1ak-1+c2ak-2+c3ak-3+...+cka0) mod M,
Wherein a, M are constants, and c1 ..., ck are input characters.
In order to avoid calculating huge H values, all mathematics is all to take M moulds, delete and addition character only need addition or First term or tail item are subtracted, all characters are moved to left one then needs to the left side the sum of entire Hk being multiplied by a.Therefore, the meter of Hk+1 Calculation can be reduced to:
Hk+1=((Hk-c1ak-1) * a+ck+1) mod M,
Therefore, inswept entire reference data stream, each hash sliding window that slides can generate a rolling cryptographic Hash.If This rolls cryptographic Hash and matches with predefined feature string (for example, least significant bit " 0 " of selected quantity), then recently Offset is registered as with reference to candidate anchor point between the character of immigration and upper reference anchor point.This matched rolling cryptographic Hash The fingerprint of also referred to as candidate anchor point, candidate anchor point are expressed as follows:(anchoring offset, anchor point fingerprint).Wherein, anchor point fingerprint is rolls Cryptographic Hash.
From the foregoing, it will be observed that target candidate anchor point can be determined in a like fashion.If fingerprint and the reference of target candidate anchor point Candidate anchor point is identical, then confirms this target candidate anchor point for target anchor point, and target anchor point and reference anchor point are marked as one Anchor point pair.In addition, anchor point density is adjustable.For example, we can be by being configured the feature with minimum effective 11 " 0 " Character string identifies an anchor point pair per 2KB to be averaged.Density is 1KB, and 10 minimum values are " 0 ".Although higher density can carry High compression ratio, but more resources will be consumed to anchor point processing.
In a specific embodiment, the anchor point that the basis detects is to coming automatic aligning target window or reference windows The step of, it further includes:
Judge the sequencing of the reference anchor point of reference data stream and the target anchor point of target data stream;
If the time detected with reference to anchor point before target anchor point, i.e., is less than target anchor point with reference to the offset of anchor point Offset then suspends reference data stream and flows to reference windows, and continues to search for flowing into the target data stream of target window, until looking for Until same target anchor point, continue to execute and string matching is carried out to target data stream and reference data stream;If with reference to anchor The time of point detection after target anchor point, i.e., is more than the offset of target anchor point with reference to the offset of anchor point, then suspends target Data flow direction target window, and continue to search for flowing into the reference data stream of reference windows, it is up to finding the same anchor point that refers to Only, continue to carry out string matching to target data stream and reference data stream.
By the above embodiments, the screening to the data character string in target data stream can be simplified, work as target data Stream continues to match not in reference windows, when target data is in reference windows, can match corresponding anchor point pair, so as to Compression ratio can be improved, improves compression efficiency.
In a specific embodiment, the target data stream and reference data stream flow into matching module and carry out character string With the step of, specifically include:
The target data stream for flowing into reference windows and reference data are done by limit matching according to LZ77 algorithms, obtain output knot Fruit, the output result are matching unit [offset, matching length] and character cell.Output result obtains defeated by coding output Go out coded data, the outputting encoded data include without change paragraph outputting encoded data and have change paragraph output Coded data;The outputting encoded data for the paragraph do not changed with reference to anchor point and bout length by forming;There is the paragraph of change It is formed with reference to anchor point, matching unit and character cell;Wherein, it is described to be labeled as referring to anchor point relative to upper one with reference to anchor point Offset;The offset of the matching unit is the offset relative to current reference anchor point.
Fig. 6 is please referred to, Fig. 6 shows the stream of the method for delta compression of the specific embodiment based on dynamic anchor point of the invention Cheng Tu.Idiographic flow step includes:When delta compression starts, step S101, using rolling hash algorithm to reference paper (ginseng Examine data flow) and file destination (target data stream) calculated, obtain with reference to candidate anchor point and determine to join according to anchor point fingerprint It examines candidate anchor point and target candidate anchor point and target candidate anchor point is determined according to anchor point fingerprint;Step S102, candidate anchor will be referred to Point and target candidate anchor point carry out anchor point pairing, if specifically, the fingerprint of target candidate anchor point is identical with reference to candidate anchor point, Candidate anchor point will be then referred to be determined as, with reference to anchor point, target candidate anchor point being determined as target anchor point, and this target anchor point With this anchor point pair is marked as with reference to anchor point;Step S103, reference data stream and target data stream are corresponded into input ginseng Examine window and target window;Step S104, string matching (LZ77) is carried out in reference windows and target window;Step S105, matching when, judge whether reference data stream and target data stream terminate, if then delta compression terminates;If otherwise after Continuous detection, and perform step S106, judge whether detect anchor point in reference windows and target window, if otherwise jumping to step Rapid S104;If so then execute step S107, it is with reference to anchor point or target anchor point to judge the anchor point;If anchor point is with reference to anchor point, Step S110 is performed, suspends reference windows, string matching is carried out to the target data stream for flowing into target window;Step S111, Judge whether to find it is corresponding with reference to anchor point, if then return to step S103;If otherwise return to step S110;If anchor point is target Anchor point performs step S108, suspends target window, continues to be detected the reference data stream for flowing into reference windows;Step S109 judges whether to find corresponding target anchor point, if then return to step S103, if otherwise return to step S108.
Fig. 7 is please referred to, Fig. 7 shows the module box of the device of invention delta compression of one embodiment based on dynamic anchor point Figure.In the embodiment of the present invention, the device of the delta compression based on dynamic anchor point is somebody's turn to do, including:
Determining module 10, for according to hash algorithm scanning target data stream and reference data stream is rolled, Hash will to be rolled It is worth identical target anchor point with being labeled as an anchor point pair with reference to anchor point, wherein, the anchor point is to being expressed as (relatively upper a target The offset of anchor point, a relatively upper offset with reference to anchor point);
Paragraph division module 20, for using anchor point to target data stream and reference data stream are divided into multiple sections respectively It falls;
First processing module 30 is handled for the paragraph to no change, section and progress including recording paragraph Coded treatment;
Second processing module 40 for handling the paragraph for having change, is included in target data stream and reference data When stream flows into matching module progress string matching, according to the anchor point detected to coming automatic aligning target window or reference window Mouthful, wherein, the target window can accommodating portion target data stream data, the reference windows can accommodating portion reference data The data of stream;
Coding module 50, for carrying out coded treatment to the result of string matching;And
Output module 60, for outputting encoded data.
In the above embodiments, which determines the target anchor in target data stream by rolling hash algorithm Reference anchor point in point and reference data stream.Roll the Kazakhstan that hash algorithm is the mobile computing cryptographic Hash using input in the window Uncommon function.Hash function allows quickly to calculate rolling Hash --- and new cryptographic Hash in old cryptographic Hash by deleting The new value that the old value of grand window and addition move into window is calculated.This is a kind of mode similar to rolling average function, Arithmetic speed can be more faster than other low-pass filters.By target anchor point with being labeled as anchor point pair with reference to anchor point, then by mesh Mark data flow flows into target window and reference data stream is inputted reference windows, and passes through Second processing module 40 and detect target Anchor point automatic aligning target window or reference windows in data flow or reference data stream;Then by coding module 50 to character The result of String matching carries out coded treatment;Finally by 60 output data of output module, this programme by intelligence alignment with reference to and Target data so that include most like data with reference to target window, better compression ratio is realized with this.
In a specific embodiment, the determining module 10 is specifically used for:
Using hash algorithm scanning target data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value are compared with default Hash characteristic value, if equal, are recorded as with reference to candidate anchor point;
Using hash algorithm scanning reference data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen to roll with bitmask and be breathed out Several data bit of uncommon value, if equal, are recorded as target candidate anchor point compared with default Hash characteristic value;
Whether comparison reference candidate anchor point is identical with the rolling cryptographic Hash of target candidate anchor point, will be referred to when the two is identical Candidate anchor point is determined as with reference to anchor point and target candidate anchor point is determined as target anchor point;
And this is marked as an anchor point pair with reference to anchor point and this target anchor point.
In above-described embodiment, the target candidate anchor point in the candidate anchor point of reference and target data stream of reference data stream It can be determined by rolling hash algorithm.The specific hash algorithm that rolls please refers to the above embodiments, and details are not described herein again.
In a specific embodiment, the Second processing module 40 is additionally operable to:
Judge the sequencing of the reference anchor point of reference data stream and the target anchor point of target data stream;
If the time detected with reference to anchor point before target anchor point, i.e., is less than target anchor point with reference to the offset of anchor point Offset then suspends reference data stream and flows to reference windows, and continues to search for flowing into the target data stream of target window, until looking for Until same target anchor point, continue to execute and string matching is carried out to target data stream and reference data stream;If with reference to anchor The time of point detection after target anchor point, i.e., is more than the offset of target anchor point with reference to the offset of anchor point, then suspends target Data flow direction target window, and continue to search for flowing into the reference data stream of reference windows, it is up to finding the same anchor point that refers to Only, continue to carry out string matching to target data stream and reference data stream.
By the above embodiments, the screening to the data character string in target data stream can be simplified, work as target data Stream continues to match not in reference windows, when target data is in reference windows, can match corresponding anchor point pair, so as to Compression ratio can be improved, improves compression efficiency.
In a specific embodiment, the Second processing module 40 is additionally operable to that according to LZ77 algorithms reference window will be flowed into Mouthful target data stream and reference data do limit matching, exported as a result, the output result for matching unit [offset, Matching length] and character cell.Output result obtains outputting encoded data by coding output.The outputting encoded data includes The outputting encoded data for the paragraph do not changed and have change paragraph outputting encoded data;The output for the paragraph do not changed Coded data with reference to anchor point and bout length by forming;There is the outputting encoded data of the paragraph of change by with reference to anchor point, matching Unit and character cell are formed;Wherein, the offset being labeled as with reference to anchor point relative to upper one with reference to anchor point;It is described The offset of matching unit is the offset relative to current reference anchor point.
The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the scope of the invention, every at this The equivalent structure transformation made under the inventive concept of invention using description of the invention and accompanying drawing content or directly/utilization indirectly It is included in the scope of patent protection of the present invention in other related technical areas.

Claims (10)

  1. A kind of 1. method of the delta compression based on dynamic anchor point, which is characterized in that the delta compression based on dynamic anchor point Method include:
    According to hash algorithm scanning target data stream and reference data stream is rolled, the identical target anchor point of cryptographic Hash and ginseng will be rolled Anchor point is examined labeled as an anchor point pair, wherein, the anchor point to be expressed as (offset of a relatively upper target anchor point, relatively on One offset with reference to anchor point);
    Using anchor point to target data stream and reference data stream are divided into multiple paragraphs respectively;
    For the paragraph of no change, then record the section of paragraph and carry out coded treatment;
    For there is the paragraph of change, then when target data stream and reference data stream flow into matching module progress string matching, According to the anchor point detected to coming automatic aligning target window or reference windows, wherein, the target window can receiving portion subhead Mark data flow data, the reference windows can accommodating portion reference data stream data;The result of string matching is carried out Coded treatment;And
    Outputting encoded data.
  2. 2. the method for the delta compression as described in claim 1 based on dynamic anchor point, which is characterized in that described to be breathed out according to rolling Uncommon algorithm scanning target data stream and reference data stream will roll the identical target anchor point of cryptographic Hash with being labeled as one with reference to anchor point The step of a anchor point pair, specifically includes:
    Using hash algorithm scanning target data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen with bitmask and rolls cryptographic Hash Several data bit compared with default Hash characteristic value, if equal, be recorded as with reference to candidate anchor point;
    Using hash algorithm scanning reference data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen with bitmask and rolls cryptographic Hash Several data bit compared with default Hash characteristic value, if equal, be recorded as target candidate anchor point;
    Whether comparison reference candidate anchor point is identical with the rolling cryptographic Hash of target candidate anchor point, and candidate will be referred to when the two is identical Anchor point is determined as with reference to anchor point and target candidate anchor point is determined as target anchor point;
    Anchor point will be referred to and target anchor point is marked as an anchor point pair.
  3. 3. the method for the delta compression as described in claim 1 based on dynamic anchor point, which is characterized in that the basis detects Anchor point to coming automatic aligning target window or reference windows the step of, further include:
    Judge the sequencing of the reference anchor point of reference data stream and the target anchor point of target data stream;
    If with reference to anchor point detect time before target anchor point, i.e., with reference to the offset of anchor point be less than target anchor point offset Amount, then suspend reference data stream and flow to reference windows, and continues to search for flowing into the target data stream of target window, until finding same Until one target anchor point, continue to execute and string matching is carried out to target data stream and reference data stream;If it is examined with reference to anchor point The time of survey after target anchor point, i.e., is more than the offset of target anchor point with reference to the offset of anchor point, then suspends target data Stream flows to target window, and continues to search for flowing into the reference data stream of reference windows, until same reference anchor point is found, after It is continuous that string matching is carried out to target data stream and reference data stream.
  4. 4. the method for the delta compression as described in claim 1 based on dynamic anchor point, which is characterized in that the target data stream And reference data stream flows into the step of matching module carries out string matching, specifically includes:
    The target data stream for flowing into reference windows and reference data are done by limit matching according to LZ77 algorithms, exported as a result, The output result is matching unit [offset, matching length] and character cell.
  5. 5. the method for the delta compression as claimed in claim 4 based on dynamic anchor point, which is characterized in that the exports coding number According to the paragraph including no change outputting encoded data and have change paragraph outputting encoded data;The paragraph do not changed Outputting encoded data by being formed with reference to anchor point and bout length;
    It is made of the outputting encoded data of the paragraph of change reference anchor point, matching unit and character cell;Wherein, the ginseng Anchor point is examined labeled as relative to upper one offset with reference to anchor point;The offset of the matching unit is relative to current reference The offset of anchor point.
  6. A kind of 6. device of the delta compression based on dynamic anchor point, which is characterized in that the delta compression based on dynamic anchor point Device include:
    Determining module is identical by cryptographic Hash is rolled for scanning target data stream and reference data stream according to rolling hash algorithm Target anchor point with reference to anchor point labeled as an anchor point pair, wherein, the anchor point is to being expressed as (a relatively upper target anchor point Offset, a relatively upper offset with reference to anchor point);
    Paragraph division module, for using anchor point to target data stream and reference data stream are divided into multiple paragraphs respectively;
    First processing module is handled for the paragraph to no change, including recording the section of paragraph and carrying out at coding Reason;
    Second processing module for handling the paragraph for having change, is included in target data stream and reference data stream flows into When matching module carries out string matching, according to the anchor point detected to coming automatic aligning target window or reference windows, wherein, The target window can accommodating portion target data stream data, the reference windows can accommodating portion reference data stream number According to;
    Coding module, for carrying out coded treatment to the result of string matching;And
    Output module, for outputting encoded data.
  7. 7. the device of the delta compression as claimed in claim 6 based on dynamic anchor point, which is characterized in that the determining module, It is specifically used for:
    Using hash algorithm scanning target data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen with bitmask and rolls cryptographic Hash Several data bit compared with default Hash characteristic value, if equal, be recorded as with reference to candidate anchor point;
    Using hash algorithm scanning reference data stream is rolled, a string of rolling cryptographic Hash are obtained, is chosen with bitmask and rolls cryptographic Hash Several data bit compared with default Hash characteristic value, if equal, be recorded as target candidate anchor point;
    Whether comparison reference candidate anchor point is identical with the rolling cryptographic Hash of target candidate anchor point, and candidate will be referred to when the two is identical Anchor point is determined as with reference to anchor point and target candidate anchor point is determined as target anchor point;
    Anchor point will be referred to and target anchor point is marked as an anchor point pair.
  8. 8. the device of the delta compression as claimed in claim 6 based on dynamic anchor point, which is characterized in that the second processing mould Block is specifically used for:
    Judge the sequencing of the reference anchor point of reference data stream and the target anchor point of target data stream;
    If with reference to anchor point detect time before target anchor point, i.e., with reference to the offset of anchor point be less than target anchor point offset Amount, then suspend reference data stream and flow to reference windows, and continues to search for flowing into the target data stream of target window, until finding same Until one target anchor point, continue to execute and string matching is carried out to target data stream and reference data stream;If it is examined with reference to anchor point The time of survey after target anchor point, i.e., is more than the offset of target anchor point with reference to the offset of anchor point, then suspends target data Stream flows to target window, and continues to search for flowing into the reference data stream of reference windows, until same reference anchor point is found, after It is continuous that string matching is carried out to target data stream and reference data stream.
  9. 9. the device of the delta compression as claimed in claim 6 based on dynamic anchor point, which is characterized in that the second processing mould Block is additionally operable to:
    The target data stream for flowing into reference windows and reference data are done by limit matching according to LZ77 algorithms, exported as a result, The output result is matching unit [offset, matching length] and character cell.
  10. 10. the device of the delta compression as claimed in claim 9 based on dynamic anchor point, which is characterized in that the exports coding Data include without change paragraph outputting encoded data and have change paragraph outputting encoded data;
    The outputting encoded data for the paragraph do not changed with reference to anchor point and bout length by forming;
    It is made of the outputting encoded data of the paragraph of change reference anchor point, matching unit and character cell;Wherein, the ginseng Anchor point is examined labeled as relative to upper one offset with reference to anchor point;The offset of the matching unit is relative to current reference The offset of anchor point.
CN201810035223.XA 2018-01-15 2018-01-15 The method and device of delta compression based on dynamic anchor point Pending CN108268628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810035223.XA CN108268628A (en) 2018-01-15 2018-01-15 The method and device of delta compression based on dynamic anchor point

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810035223.XA CN108268628A (en) 2018-01-15 2018-01-15 The method and device of delta compression based on dynamic anchor point

Publications (1)

Publication Number Publication Date
CN108268628A true CN108268628A (en) 2018-07-10

Family

ID=62775707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810035223.XA Pending CN108268628A (en) 2018-01-15 2018-01-15 The method and device of delta compression based on dynamic anchor point

Country Status (1)

Country Link
CN (1) CN108268628A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287149A (en) * 2019-05-10 2019-09-27 同济大学 A kind of matching coding method using Hash Search

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050044294A1 (en) * 2003-07-17 2005-02-24 Vo Binh Dao Method and apparatus for window matching in delta compressors
CN101847998A (en) * 2010-04-15 2010-09-29 同济大学 High-performance GML flow compression method
US20120185612A1 (en) * 2011-01-19 2012-07-19 Exar Corporation Apparatus and method of delta compression
CN105515586A (en) * 2015-12-14 2016-04-20 华中科技大学 Rapid delta compression method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050044294A1 (en) * 2003-07-17 2005-02-24 Vo Binh Dao Method and apparatus for window matching in delta compressors
CN101847998A (en) * 2010-04-15 2010-09-29 同济大学 High-performance GML flow compression method
US20120185612A1 (en) * 2011-01-19 2012-07-19 Exar Corporation Apparatus and method of delta compression
CN105515586A (en) * 2015-12-14 2016-04-20 华中科技大学 Rapid delta compression method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287149A (en) * 2019-05-10 2019-09-27 同济大学 A kind of matching coding method using Hash Search

Similar Documents

Publication Publication Date Title
Zhang et al. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication
US8799239B2 (en) Method, apparatus and computer program product for performing a query using a decision diagram
CN105721340B (en) A kind of online reading pre-load amount calculation method and device
US20170147597A1 (en) Quality score compression for improving downstream genotyping accuracy
CN101044480A (en) Method, device and system for automatic retrieval of similar objects in a network of devices
CN1858734A (en) Data storaging and searching method
US6735600B1 (en) Editing protocol for flexible search engines
CN116915259B (en) Bin allocation data optimized storage method and system based on internet of things
US8117343B2 (en) Landmark chunking of landmarkless regions
CN101459489B (en) Deep packet detection device and method
CN108268628A (en) The method and device of delta compression based on dynamic anchor point
US11755540B2 (en) Chunking method and apparatus
CN116015311A (en) Lz4 text compression method based on sliding dictionary implementation
US7895347B2 (en) Compact encoding of arbitrary length binary objects
CN104123309A (en) Method and system used for data management
US7484068B2 (en) Storage space management methods and systems
CN103607412A (en) Content center multiple-interest-packet processing method based on tree
Kim et al. Design and implementation of binary file similarity evaluation system
CN102722557A (en) Self-adaption identification method for identical data blocks
CN111414339A (en) File processing method, system, device, equipment and medium
CN116821970A (en) Sampling detection method based on block chain and Internet of things
US20100228703A1 (en) Reducing memory required for prediction by partial matching models
US20080114722A1 (en) Method For Low Distortion Embedding Of Edit Distance To Hamming Distance
CN111597379B (en) Audio searching method and device, computer equipment and computer-readable storage medium
WO2001071483A2 (en) Determinaton of a minimum or maximum value in a set of data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710

RJ01 Rejection of invention patent application after publication