CN101546320B

CN101546320B - Data difference analysis method based on sliding window

Info

Publication number: CN101546320B
Application number: CN2008101028174A
Authority: CN
Inventors: 林兆祥
Original assignee: BEIJING CUZKON TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: Beijing CUZKON Technology Development Co., Ltd.
Priority date: 2008-03-27
Filing date: 2008-03-27
Publication date: 2011-11-16
Anticipated expiration: 2028-03-27
Also published as: CN101546320A

Abstract

The invention belongs to the field of data compression, more particularly relates to a method for carrying out data difference analysis by adopting a sliding window. In a plurality of applications of a computer, if differences among different data can be analyzed, great help for reducing data redundancy and improving the processing efficiency of the computer can be brought. The method adopted by the invention comprises the steps of dividing original data into equivalent data blocks and calculating the hash value of each data block; adopting the method of the sliding window in the original data to position whether the data block which is equivalent to sliding window data exists or not; further positioning the matching range around the equivalent data block if the data block which is equivalent to the sliding window data exists and recording the matching situation into differential data; moving the sliding window and then continuously comparing if the data block which is equivalent to the sliding window data does not exist; and repeating the operations till the data is over. By adopting the method of the invention, the difference between the original data and the target data can be analyzed rapidly.

Description

A kind of data difference analysis method based on moving window

Technical field

The invention belongs to field of data compression, be specifically related to a kind of method that adopts moving window to carry out data difference analysis.

Background technology

In computer system, often there are the data that only have JND in a large number each other in communication and the storing process.Such as, a user may repeatedly revise a document, repeatedly saves as different files in modification process,

Between these different files, difference each other is very little, preserves a copy but computer system is necessary for each file, has so just wasted a large amount of storage spaces.If such file transmits on network, each transmission all is the very little data of difference on the network, the bandwidth of equally also having wasted network.

If we can partly separate the difference between the different pieces of information, only the part of difference to be handled, this will increase substantially the treatment effeciency of computing machine.For example, for one with the system of file storage on remote server, the user after client is with file modification, needs to be transferred to whole file on the server again at every turn, in this processing mode, need on network, transmit the data of whole file; If the data difference analysis before and after revising can be come out, the part that then only needs to be modified is transferred to server.Generally, the difference part only accounts for the very little ratio of file, therefore will save the bandwidth of network in a large number.

For convenience of explanation, below that processing procedure is related source data is called raw data; To need analyzed data to be called target data in the processing procedure; The data of describing difference between target data and the raw data are called variance data.

In traditional method, usually raw data and target data all are divided into equal-sized data block, in raw data and target data, search the identical data block of content then, the accuracy rate of this methods analyst is lower.With the data block size is 2 to be example, and raw data is abcdef, and target data is kabcde, and the result of piecemeal is: ab|cd|ef| (raw data), ka|bc|de (target data); Obviously, adopt this method of partition, do not have identical data block in raw data and the target data, and in fact, have a large amount of identical data (abcde) in these two data.

Summary of the invention

The purpose of this invention is to provide a kind of technology, effectively analyze the difference between different pieces of information fast, thereby reach the effect that reduces data redundancy, improve the efficient at aspects such as storage and transmission of computing machine.

In order to reach above target, the technical solution used in the present invention is that a kind of data difference analysis method based on moving window is applied to field of data compression, may further comprise the steps:

1) raw data is divided into equal-sized data sub-block;

2) the hash value of each data sub-block in the calculating raw data;

3) starting position that equals target data when the pre-treatment position is set;

4) if work as the pre-treatment position, change 10) to the size of the size of data between the target data end position less than the data sub-block of raw data;

5) from getting the equal-sized data block of data sub-block of a size and raw data as data window when the pre-treatment position;

6) determine the matching range of raw data and target data according to data window;

7) if do not find matching range, the next position that equals to work as originally the pre-treatment position when the pre-treatment position is set, change 4);

8) the Data Matching situation is write variance data;

9) the next position that equals matching range when the pre-treatment position is set, changes 4);

10) remaining Data Matching situation is write variance data.

Above-mentioned steps 6) determine the matching range of raw data and target data according to data window, its detailed step is as follows:

2a). the data sub-block of from the data sub-block of raw data, looking for the hash value to equate with the hash value of data window;

2b). if the hash function is not strong anti-collision, further looks for the data content data sub-block identical with the data content of data window from the data sub-block that the hash value equates;

2c). for each data content data sub-block identical, can determine a matching range with the data content of data window; Select a suitable matching range, return this matching range, if there is no suitable matching range does not then find matching range, returns.

Above-mentioned steps 2c),, can determine a matching range, determine that matching range can have two kinds of methods for each data content data sub-block identical with the data content of data window:

3a). directly the scope of data window as matching range;

3b). identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.

Above-mentioned 2c) select a suitable matching range, it is characterized in that if there are a plurality of matching ranges, only need to select one of them to get final product, the strategy of selection can have multiple, but does not influence essence of the present invention, and the strategy of selection includes but are not limited to:

4a). select first matching range;

4b). a matching range of range of choice maximum;

4c). select first scope to be not less than the matching range of predetermined value.

Effect of the present invention is the method that adopts the present invention to introduce can quick and precisely find difference and same section between the different pieces of information, thereby reaches the effect that reduces data redundancy.

Description of drawings

Fig. 1 is a computational data difference flow process;

Fig. 2 is the exemplary plot of variance data;

Fig. 3 is a matching range of searching target data and raw data;

Fig. 4 is raw data and target data example;

Fig. 5 is that raw data is divided into size is 4 data block;

Fig. 6 is a hash value of calculating each data block respectively;

Fig. 7 is to use the match block in the raw data of moving window location;

Fig. 8 is the matching range of match block position.

Embodiment

Below in conjunction with Figure of description, the present invention is done further description for example.

As shown in Figure 1, Fig. 1 is a computational data difference process flow diagram, and the step of computational data difference has been described:

1) raw data is divided into equal-sized data sub-block;

2) and respectively calculate the hash value of each data sub-block in the raw data;

4) if the size of remaining data less than the size of the data sub-block of raw data, changes 10);

6) determine the matching range of raw data and target data according to data window.

8) the Data Matching situation is write variance data;

10) the remaining data match condition is write variance data.

As shown in Figure 2, Fig. 2 is an example that the Data Matching situation is write variance data.The method that matched data is write variance data can have multiple, but does not influence essence of the present invention, only lifts a kind of with example here.Match condition is divided two kinds, and a kind of is the part that can find identical data in raw data, as adding of exemplary plot the inside

Data division

Another situation is not find the part of coupling from the raw data the inside, as the k p and the m l of exemplary plot the inside.Recording method in exemplary plot is: represent that with 0xff the data of back are not find the data of coupling, and then the 0xff back is data length, and the data length back is the copy that does not find matched data; Represent that with 0x00 the data of back are to find the part of coupling, and then the 0x00 back is the length of coupling, next is the position of matched data in raw data again.Be described in detail as follows:

2a). the kp of target data the inside is that a length is 2 data, and these data do not find identical data in the raw data the inside.Therefore, this part data is to find the part of coupling from the raw data the inside.For this data, adopt following method when writing variance data: write 0xff toward the variance data the inside, then write data length 2, and then write the kp of data own, therefore, the result of variance data the inside is 0xff|2|kp|

2b). target data

Be that a length is 5 data, these data can find identical data (beginning length with second position of raw data is that 5 data are identical) in raw data, and therefore, this part data is to find the part of identical data in raw data.For this data, adopt following method when writing variance data: write 0x00 toward the variance data the inside, and then write data length 5, and then (position of first data is 0 to write the position 1 at identical data place in the raw data, the position of second data is 1, and the like), therefore, the result of variance data the inside is 0x00|5|1|.

2c). the same 2a of ml of target data the inside) similar, the result who writes is: 0xff|2|ml|

2d). to sum up, the result of variance data is as shown in Figure 2.

As shown in Figure 3, Fig. 3 is the flow process of searching the matching range of target data and raw data, has further described the detailed process that step 6) is searched the matching range of raw data and target data in Fig. 1 explanation according to data window:

3a). the data sub-block of from the data sub-block of raw data, looking for the hash value to equate with the hash value of data window;

3b). from the data sub-block that the hash value equates, further look for the data content data sub-block identical with the data content of data window.This step does not have only when the hash function is not strong anti-collision just to be needed.So-called hash function is not that strong anti-collision is meant: identical of the hash value that the hash function calculation is come out shows that data content may be identical, but can not guarantee that in actual applications data content is identical.Therefore, when the hash function is not strong anti-collision, need to check further whether data content is identical.

3c). select a suitable matching range, return this matching range.If there is no suitable matching range does not then find matching range, returns.

In the above-mentioned steps, if found a matching range, this matching range and the last data that find between the matching range in the target data are the data that can't find coupling in the raw data the inside so.

For above-mentioned 3c) in, select a suitable matching range, be described in detail as follows: for the data window in the target data, may there be the identical data sub-block of data a plurality of and this data window in the raw data, therefore, may there be a plurality of matching ranges, therefrom select one as a result of to get final product.As for the strategy of selecting, can have multiple, but do not influence essence of the present invention, be exemplified below:

4a). select first matching range, promptly return first matching range that in computation process, finds;

4b). a matching range of range of choice maximum, promptly return the scope of data length maximum in all matching ranges;

4c). select first scope to be not less than the matching range of predetermined value.The matching range that is no less than 16 characters such as the data in matching range of selection.

For above-mentioned 3c) described in suitable matching range of selection, it is described in detail as follows: for each data content data sub-block identical with the data content of data window, can determine a matching range, its method of determining matching range can have:

5a). directly the scope of data window as matching range;

5b). identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.

As shown in Figure 4, Fig. 4 has illustrated the raw data that is used for example and the content of target data.

As shown in Figure 5, the data block that raw data is divided into blocksize size (blocksize is 4 in the example explanation).

As shown in Figure 6, calculate the hash value of each data block in the raw data respectively.

As shown in Figure 7, use the matched data piece in moving window localizing objects data and the raw data, its concrete steps are as follows:

7a). from the starting position of target data, get a size and be 4 data block KKBC, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;

7b). the data window in the moving target data, take off a data block KBCD, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;

7c). the data window in the moving target data, take off a data block BCDE, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;

7d). the data window in the moving target data, take off a data block CDEF, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;

7e). the data window in the moving target data, take off a data block DEFG, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;

7f). the data window in the moving target data, take off a data block EFGH, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, find data block 2 couplings in this data block and the raw data, the match is successful;

As shown in Figure 8, determine near the matching range that match block is, its method is as follows

8a). first method, promptly above-mentioned 5a) described in method: directly the scope of data window as matching range; It is data sub-block identical in the raw data with the data window of target data.Add among the figure

Partial data be the matching range that adopts this method to obtain;

8b). second method: the method described in the promptly above-mentioned 5b: identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.Add among the figure

Part be the matching range that adopts this method to obtain.

Claims

1. the data difference analysis method based on moving window is applied to field of data compression, it is characterized in that may further comprise the steps:

1) raw data is divided into equal-sized data sub-block;

2) the hash value of each data sub-block in the calculating raw data;

8) the Data Matching situation is write variance data;

10) remaining Data Matching situation is write variance data.

2. a kind of data difference analysis method based on moving window according to claim 1 is characterized in that step 6) determines the matching range of raw data and target data according to data window, and its step is as follows:

3. a kind of data difference analysis method according to claim 2 based on moving window, it is characterized in that step 2c), for each data content data sub-block identical with the data content of data window, can determine a matching range, determine that matching range can have two kinds of methods:

3a). directly the scope of data window as matching range;

4. a kind of data difference analysis method based on moving window according to claim 2 is characterized in that step 2c) select a suitable matching range, wherein select the strategy of proper fit scope to be:

4a). select first matching range; Or

4b). a matching range of range of choice maximum; Or