CN101546320B - Data difference analysis method based on sliding window - Google Patents

Data difference analysis method based on sliding window Download PDF

Info

Publication number
CN101546320B
CN101546320B CN2008101028174A CN200810102817A CN101546320B CN 101546320 B CN101546320 B CN 101546320B CN 2008101028174 A CN2008101028174 A CN 2008101028174A CN 200810102817 A CN200810102817 A CN 200810102817A CN 101546320 B CN101546320 B CN 101546320B
Authority
CN
China
Prior art keywords
data
block
matching range
window
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008101028174A
Other languages
Chinese (zh)
Other versions
CN101546320A (en
Inventor
林兆祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing CUZKON Technology Development Co., Ltd.
Original Assignee
BEIJING CUZKON TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING CUZKON TECHNOLOGY DEVELOPMENT Co Ltd filed Critical BEIJING CUZKON TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN2008101028174A priority Critical patent/CN101546320B/en
Publication of CN101546320A publication Critical patent/CN101546320A/en
Application granted granted Critical
Publication of CN101546320B publication Critical patent/CN101546320B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention belongs to the field of data compression, more particularly relates to a method for carrying out data difference analysis by adopting a sliding window. In a plurality of applications of a computer, if differences among different data can be analyzed, great help for reducing data redundancy and improving the processing efficiency of the computer can be brought. The method adopted by the invention comprises the steps of dividing original data into equivalent data blocks and calculating the hash value of each data block; adopting the method of the sliding window in the original data to position whether the data block which is equivalent to sliding window data exists or not; further positioning the matching range around the equivalent data block if the data block which is equivalent to the sliding window data exists and recording the matching situation into differential data; moving the sliding window and then continuously comparing if the data block which is equivalent to the sliding window data does not exist; and repeating the operations till the data is over. By adopting the method of the invention, the difference between the original data and the target data can be analyzed rapidly.

Description

A kind of data difference analysis method based on moving window
Technical field
The invention belongs to field of data compression, be specifically related to a kind of method that adopts moving window to carry out data difference analysis.
Background technology
In computer system, often there are the data that only have JND in a large number each other in communication and the storing process.Such as, a user may repeatedly revise a document, repeatedly saves as different files in modification process,
Between these different files, difference each other is very little, preserves a copy but computer system is necessary for each file, has so just wasted a large amount of storage spaces.If such file transmits on network, each transmission all is the very little data of difference on the network, the bandwidth of equally also having wasted network.
If we can partly separate the difference between the different pieces of information, only the part of difference to be handled, this will increase substantially the treatment effeciency of computing machine.For example, for one with the system of file storage on remote server, the user after client is with file modification, needs to be transferred to whole file on the server again at every turn, in this processing mode, need on network, transmit the data of whole file; If the data difference analysis before and after revising can be come out, the part that then only needs to be modified is transferred to server.Generally, the difference part only accounts for the very little ratio of file, therefore will save the bandwidth of network in a large number.
For convenience of explanation, below that processing procedure is related source data is called raw data; To need analyzed data to be called target data in the processing procedure; The data of describing difference between target data and the raw data are called variance data.
In traditional method, usually raw data and target data all are divided into equal-sized data block, in raw data and target data, search the identical data block of content then, the accuracy rate of this methods analyst is lower.With the data block size is 2 to be example, and raw data is abcdef, and target data is kabcde, and the result of piecemeal is: ab|cd|ef| (raw data), ka|bc|de (target data); Obviously, adopt this method of partition, do not have identical data block in raw data and the target data, and in fact, have a large amount of identical data (abcde) in these two data.
Summary of the invention
The purpose of this invention is to provide a kind of technology, effectively analyze the difference between different pieces of information fast, thereby reach the effect that reduces data redundancy, improve the efficient at aspects such as storage and transmission of computing machine.
In order to reach above target, the technical solution used in the present invention is that a kind of data difference analysis method based on moving window is applied to field of data compression, may further comprise the steps:
1) raw data is divided into equal-sized data sub-block;
2) the hash value of each data sub-block in the calculating raw data;
3) starting position that equals target data when the pre-treatment position is set;
4) if work as the pre-treatment position, change 10) to the size of the size of data between the target data end position less than the data sub-block of raw data;
5) from getting the equal-sized data block of data sub-block of a size and raw data as data window when the pre-treatment position;
6) determine the matching range of raw data and target data according to data window;
7) if do not find matching range, the next position that equals to work as originally the pre-treatment position when the pre-treatment position is set, change 4);
8) the Data Matching situation is write variance data;
9) the next position that equals matching range when the pre-treatment position is set, changes 4);
10) remaining Data Matching situation is write variance data.
Above-mentioned steps 6) determine the matching range of raw data and target data according to data window, its detailed step is as follows:
2a). the data sub-block of from the data sub-block of raw data, looking for the hash value to equate with the hash value of data window;
2b). if the hash function is not strong anti-collision, further looks for the data content data sub-block identical with the data content of data window from the data sub-block that the hash value equates;
2c). for each data content data sub-block identical, can determine a matching range with the data content of data window; Select a suitable matching range, return this matching range, if there is no suitable matching range does not then find matching range, returns.
Above-mentioned steps 2c),, can determine a matching range, determine that matching range can have two kinds of methods for each data content data sub-block identical with the data content of data window:
3a). directly the scope of data window as matching range;
3b). identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.
Above-mentioned 2c) select a suitable matching range, it is characterized in that if there are a plurality of matching ranges, only need to select one of them to get final product, the strategy of selection can have multiple, but does not influence essence of the present invention, and the strategy of selection includes but are not limited to:
4a). select first matching range;
4b). a matching range of range of choice maximum;
4c). select first scope to be not less than the matching range of predetermined value.
Effect of the present invention is the method that adopts the present invention to introduce can quick and precisely find difference and same section between the different pieces of information, thereby reaches the effect that reduces data redundancy.
Description of drawings
Fig. 1 is a computational data difference flow process;
Fig. 2 is the exemplary plot of variance data;
Fig. 3 is a matching range of searching target data and raw data;
Fig. 4 is raw data and target data example;
Fig. 5 is that raw data is divided into size is 4 data block;
Fig. 6 is a hash value of calculating each data block respectively;
Fig. 7 is to use the match block in the raw data of moving window location;
Fig. 8 is the matching range of match block position.
Embodiment
Below in conjunction with Figure of description, the present invention is done further description for example.
As shown in Figure 1, Fig. 1 is a computational data difference process flow diagram, and the step of computational data difference has been described:
1) raw data is divided into equal-sized data sub-block;
2) and respectively calculate the hash value of each data sub-block in the raw data;
3) starting position that equals target data when the pre-treatment position is set;
4) if the size of remaining data less than the size of the data sub-block of raw data, changes 10);
5) from getting the equal-sized data block of data sub-block of a size and raw data as data window when the pre-treatment position;
6) determine the matching range of raw data and target data according to data window.
7) if do not find matching range, the next position that equals to work as originally the pre-treatment position when the pre-treatment position is set, change 4);
8) the Data Matching situation is write variance data;
9) the next position that equals matching range when the pre-treatment position is set, changes 4);
10) the remaining data match condition is write variance data.
As shown in Figure 2, Fig. 2 is an example that the Data Matching situation is write variance data.The method that matched data is write variance data can have multiple, but does not influence essence of the present invention, only lifts a kind of with example here.Match condition is divided two kinds, and a kind of is the part that can find identical data in raw data, as adding of exemplary plot the inside
Figure GSB00000512743800041
Data division
Figure GSB00000512743800042
Another situation is not find the part of coupling from the raw data the inside, as the k p and the m l of exemplary plot the inside.Recording method in exemplary plot is: represent that with 0xff the data of back are not find the data of coupling, and then the 0xff back is data length, and the data length back is the copy that does not find matched data; Represent that with 0x00 the data of back are to find the part of coupling, and then the 0x00 back is the length of coupling, next is the position of matched data in raw data again.Be described in detail as follows:
2a). the kp of target data the inside is that a length is 2 data, and these data do not find identical data in the raw data the inside.Therefore, this part data is to find the part of coupling from the raw data the inside.For this data, adopt following method when writing variance data: write 0xff toward the variance data the inside, then write data length 2, and then write the kp of data own, therefore, the result of variance data the inside is 0xff|2|kp|
2b). target data
Figure GSB00000512743800043
Be that a length is 5 data, these data can find identical data (beginning length with second position of raw data is that 5 data are identical) in raw data, and therefore, this part data is to find the part of identical data in raw data.For this data, adopt following method when writing variance data: write 0x00 toward the variance data the inside, and then write data length 5, and then (position of first data is 0 to write the position 1 at identical data place in the raw data, the position of second data is 1, and the like), therefore, the result of variance data the inside is 0x00|5|1|.
2c). the same 2a of ml of target data the inside) similar, the result who writes is: 0xff|2|ml|
2d). to sum up, the result of variance data is as shown in Figure 2.
As shown in Figure 3, Fig. 3 is the flow process of searching the matching range of target data and raw data, has further described the detailed process that step 6) is searched the matching range of raw data and target data in Fig. 1 explanation according to data window:
3a). the data sub-block of from the data sub-block of raw data, looking for the hash value to equate with the hash value of data window;
3b). from the data sub-block that the hash value equates, further look for the data content data sub-block identical with the data content of data window.This step does not have only when the hash function is not strong anti-collision just to be needed.So-called hash function is not that strong anti-collision is meant: identical of the hash value that the hash function calculation is come out shows that data content may be identical, but can not guarantee that in actual applications data content is identical.Therefore, when the hash function is not strong anti-collision, need to check further whether data content is identical.
3c). select a suitable matching range, return this matching range.If there is no suitable matching range does not then find matching range, returns.
In the above-mentioned steps, if found a matching range, this matching range and the last data that find between the matching range in the target data are the data that can't find coupling in the raw data the inside so.
For above-mentioned 3c) in, select a suitable matching range, be described in detail as follows: for the data window in the target data, may there be the identical data sub-block of data a plurality of and this data window in the raw data, therefore, may there be a plurality of matching ranges, therefrom select one as a result of to get final product.As for the strategy of selecting, can have multiple, but do not influence essence of the present invention, be exemplified below:
4a). select first matching range, promptly return first matching range that in computation process, finds;
4b). a matching range of range of choice maximum, promptly return the scope of data length maximum in all matching ranges;
4c). select first scope to be not less than the matching range of predetermined value.The matching range that is no less than 16 characters such as the data in matching range of selection.
For above-mentioned 3c) described in suitable matching range of selection, it is described in detail as follows: for each data content data sub-block identical with the data content of data window, can determine a matching range, its method of determining matching range can have:
5a). directly the scope of data window as matching range;
5b). identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.
As shown in Figure 4, Fig. 4 has illustrated the raw data that is used for example and the content of target data.
As shown in Figure 5, the data block that raw data is divided into blocksize size (blocksize is 4 in the example explanation).
As shown in Figure 6, calculate the hash value of each data block in the raw data respectively.
As shown in Figure 7, use the matched data piece in moving window localizing objects data and the raw data, its concrete steps are as follows:
7a). from the starting position of target data, get a size and be 4 data block KKBC, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;
7b). the data window in the moving target data, take off a data block KBCD, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;
7c). the data window in the moving target data, take off a data block BCDE, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;
7d). the data window in the moving target data, take off a data block CDEF, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;
7e). the data window in the moving target data, take off a data block DEFG, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, do not find identical data block, it fails to match;
7f). the data window in the moving target data, take off a data block EFGH, calculate the hash value of this data block, by this hash value look for raw data in the data block that is complementary, find data block 2 couplings in this data block and the raw data, the match is successful;
As shown in Figure 8, determine near the matching range that match block is, its method is as follows
8a). first method, promptly above-mentioned 5a) described in method: directly the scope of data window as matching range; It is data sub-block identical in the raw data with the data window of target data.Add among the figure
Figure GSB00000512743800061
Partial data be the matching range that adopts this method to obtain;
8b). second method: the method described in the promptly above-mentioned 5b: identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.Add among the figure
Figure GSB00000512743800062
Part be the matching range that adopts this method to obtain.

Claims (4)

1. the data difference analysis method based on moving window is applied to field of data compression, it is characterized in that may further comprise the steps:
1) raw data is divided into equal-sized data sub-block;
2) the hash value of each data sub-block in the calculating raw data;
3) starting position that equals target data when the pre-treatment position is set;
4) if work as the pre-treatment position, change 10) to the size of the size of data between the target data end position less than the data sub-block of raw data;
5) from getting the equal-sized data block of data sub-block of a size and raw data as data window when the pre-treatment position;
6) determine the matching range of raw data and target data according to data window;
7) if do not find matching range, the next position that equals to work as originally the pre-treatment position when the pre-treatment position is set, change 4);
8) the Data Matching situation is write variance data;
9) the next position that equals matching range when the pre-treatment position is set, changes 4);
10) remaining Data Matching situation is write variance data.
2. a kind of data difference analysis method based on moving window according to claim 1 is characterized in that step 6) determines the matching range of raw data and target data according to data window, and its step is as follows:
2a). the data sub-block of from the data sub-block of raw data, looking for the hash value to equate with the hash value of data window;
2b). if the hash function is not strong anti-collision, further looks for the data content data sub-block identical with the data content of data window from the data sub-block that the hash value equates;
2c). for each data content data sub-block identical, can determine a matching range with the data content of data window; Select a suitable matching range, return this matching range, if there is no suitable matching range does not then find matching range, returns.
3. a kind of data difference analysis method according to claim 2 based on moving window, it is characterized in that step 2c), for each data content data sub-block identical with the data content of data window, can determine a matching range, determine that matching range can have two kinds of methods:
3a). directly the scope of data window as matching range;
3b). identical and data that be not recorded in the variance data are also included matching range in correspondence position content around data window and the data sub-block.
4. a kind of data difference analysis method based on moving window according to claim 2 is characterized in that step 2c) select a suitable matching range, wherein select the strategy of proper fit scope to be:
4a). select first matching range; Or
4b). a matching range of range of choice maximum; Or
4c). select first scope to be not less than the matching range of predetermined value.
CN2008101028174A 2008-03-27 2008-03-27 Data difference analysis method based on sliding window Expired - Fee Related CN101546320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101028174A CN101546320B (en) 2008-03-27 2008-03-27 Data difference analysis method based on sliding window

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101028174A CN101546320B (en) 2008-03-27 2008-03-27 Data difference analysis method based on sliding window

Publications (2)

Publication Number Publication Date
CN101546320A CN101546320A (en) 2009-09-30
CN101546320B true CN101546320B (en) 2011-11-16

Family

ID=41193460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101028174A Expired - Fee Related CN101546320B (en) 2008-03-27 2008-03-27 Data difference analysis method based on sliding window

Country Status (1)

Country Link
CN (1) CN101546320B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2475327A (en) * 2009-11-16 2011-05-18 Alexander Jackson-Smith Processing binary data arranged into segments or blocks using a value based on the binary ones in the segments to transform part of the segment.
CN101706825B (en) * 2009-12-10 2011-04-20 华中科技大学 Replicated data deleting method based on file content types
CN102467571A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block partition method and addition method for data de-duplication
CN102214210B (en) * 2011-05-16 2013-03-13 华为数字技术(成都)有限公司 Method, device and system for processing repeating data
CN102541991B (en) * 2011-11-14 2014-12-24 广东威创视讯科技股份有限公司 Method and system for file processing
US9195666B2 (en) * 2012-01-17 2015-11-24 Apple Inc. Location independent files
CN102682086B (en) * 2012-04-23 2014-11-05 华为技术有限公司 Data segmentation method and data segmentation equipment
CN106850842A (en) * 2012-06-28 2017-06-13 北京奇虎科技有限公司 A kind of download of file, method for uploading and device
CN103617215B (en) * 2013-11-20 2017-02-08 上海爱数信息技术股份有限公司 Method for generating multi-version files by aid of data difference algorithm
CN105095473B (en) * 2015-08-11 2018-12-18 北京思特奇信息技术股份有限公司 The method and system that a kind of pair of variance data is analyzed
CN108769973B (en) * 2018-07-19 2021-04-02 深圳全志在线有限公司 Privacy protection method of Bluetooth equipment
CN108990055B (en) * 2018-07-19 2021-07-30 深圳全志在线有限公司 Privacy protection circuit of bluetooth equipment
CN110083743B (en) * 2019-03-28 2021-11-16 哈尔滨工业大学(深圳) Rapid similar data detection method based on unified sampling
CN110362343A (en) * 2019-07-19 2019-10-22 上海交通大学 The method of the detection bytecode similarity of N-Gram
CN111598177A (en) * 2020-05-19 2020-08-28 中国科学院空天信息创新研究院 Self-adaptive maximum sliding window matching method facing low-overlapping image matching

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1801197A (en) * 2005-01-07 2006-07-12 环隆电气股份有限公司 Difference data integration and comparison method
CN101067792A (en) * 2006-05-04 2007-11-07 国际商业机器公司 System and method for scalable processing of multi-way data stream correlations

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1801197A (en) * 2005-01-07 2006-07-12 环隆电气股份有限公司 Difference data integration and comparison method
CN101067792A (en) * 2006-05-04 2007-11-07 国际商业机器公司 System and method for scalable processing of multi-way data stream correlations

Also Published As

Publication number Publication date
CN101546320A (en) 2009-09-30

Similar Documents

Publication Publication Date Title
CN101546320B (en) Data difference analysis method based on sliding window
Woodruff et al. Subspace embeddings and\ell_p-regression using exponential random variables
Xie et al. Online cross-modal hashing for web image retrieval
CN102622366B (en) Similar picture identification method and similar picture identification device
Qin et al. Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors
EP2659377B1 (en) Adaptive index for data deduplication
CN104820717B (en) A kind of storage of mass small documents and management method and system
CN101887457B (en) Content-based copy image detection method
US20170161641A1 (en) Streamlined analytic model training and scoring system
CN101561813B (en) Method for analyzing similarity of character string under Web environment
WO2008083046A2 (en) Data segmentation using shift-varying predicate function fingerprinting
US10949424B2 (en) Optimization technique for database application
CN104978351A (en) Backup method of mass small files and cloud store gateway
CN103440301B (en) A kind of data multi-duplicate hybrid storage method and system
CN103649946A (en) Transmitting filesystem changes over a network
CN107506260A (en) A kind of dynamic division database incremental backup method
CN104361068B (en) Parallel method of partition and system during a kind of data deduplication
CN109408681A (en) A kind of character string matching method, device, equipment and readable storage medium storing program for executing
CN101398837B (en) Method for rapidly matching sms text
CN104346401A (en) Method and device for message forwarding between components in cloud management platform
CN101911058A (en) Generation of a representative data string
CN111415196A (en) Advertisement recall method, device, server and storage medium
Kim et al. Design and implementation of binary file similarity evaluation system
CN110083743B (en) Rapid similar data detection method based on unified sampling
Wu et al. A feature-based intelligent deduplication compression system with extreme resemblance detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: BEIJING XINGYU ZHONGKE TECHNOLOGY DEVELOPMENT CO.,

Free format text: FORMER OWNER: LIN ZHAOXIANG

Effective date: 20110719

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 ROOM 1607, NO. 1, HAIDIAN SOUTH ROAD, HAIDIAN DISTRICT, BEIJING TO: 100101 A2201, TEAM CENTER, DATUN ROAD, CHAOYANG DISTRICT, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20110719

Address after: 100101 Beijing city Chaoyang District Datun Road Theo center A2201

Applicant after: Beijing CUZKON Technology Development Co., Ltd.

Address before: 100080, room 1, 1607 South Haidian Road, Beijing, Haidian District

Applicant before: Lin Zhaoxiang

DD01 Delivery of document by public notice

Addressee: Lin Zhaoxiang

Document name: Notification of Passing Examination on Formalities

C14 Grant of patent or utility model
GR01 Patent grant
DD01 Delivery of document by public notice

Addressee: Beijing CUZKON Technology Development Co., Ltd.

Document name: Notification to Pay the Fees

DD01 Delivery of document by public notice

Addressee: Beijing CUZKON Technology Development Co., Ltd.

Document name: Notification of Passing Examination on Formalities

DD01 Delivery of document by public notice

Addressee: Beijing CUZKON Technology Development Co., Ltd.

Document name: Notification of Termination of Patent Right

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111116

Termination date: 20140327

EXPY Termination of patent right or utility model