CN106648991A

CN106648991A - Duplicated data deletion method in data recovery system

Info

Publication number: CN106648991A
Application number: CN201611235476.9A
Authority: CN
Inventors: 祁晖; 底晓强; 李锦青; 宋小龙; 毕琳; 蒋振刚; 杨华民; 从立钢; 任维武
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2017-05-10

Abstract

The invention discloses a duplicated data deletion method in a data recovery system, relates to the field of data storage and solves the problem that an existing duplicated data deletion method is low in efficiency and large in occupied memory space. According to the deletion method provided by the invention, the number of fingerprints in a fingerprint base is reduced through sampling; hash values of samples are taken as fingerprints of the samples; and the fingerprints of the samples are not calculated through utilization of fingerprint generating algorithms such as MD5, so the fingerprint base generation and search efficiency are relatively high. According to the method, sliding windows with the minimum hash values are taken as the samples; when partial content of a file is changed, the positions of the samples may not be changed, namely the hash values of the samples are still minimum, so the mode is insensitive to partial update of the file. Compared with an enhanced position sensing sampling method, the deletion method provided by the invention has the advantage that the similarity of the obtained samples is higher. According to the method, the similarity and the duplication deletion rate of the samples when the number of the samples is different are tested, and test results show that according to the method, the higher similarity and duplication deletion rate can be obtained.

Description

Data de-duplication method in data disaster tolerance system

Technical field

The present invention relates to field of data storage, is related to a kind of data de-duplication method, and in particular to data disaster tolerance system In data de-duplication method.

Background technology

Data are the oil of information age, especially in the big data epoch, it is ensured that the security and availability of data is one The highly important task of item.Data disaster tolerance system causes significant data that disaster or human error are occurring by remote backup data Shi Buhui loses.But constantly the data of backup have higher redundancy, not only waste memory space, and also reduce Storage performance, increased carrying cost.Data de-duplication technology can eliminate the redundant data in disaster tolerance system.Duplicate data The general step of deletion is：When file is increased in storage system newly, first piecemeal is carried out to file, then generated using fingerprint and calculated Method calculates the fingerprint of each piecemeal, is subsequently contrasted with existing piecemeal fingerprint in storage system, if the piecemeal fingerprint is not Exist, then store corresponding deblocking, otherwise need to only record the pointer for pointing to data with existing piecemeal to save memory space.

Traditional data de-duplication method needs the whole piecemeal fingerprints of storage, although again deletion rate is high (deletes rate to delete again The data volume removed with delete again before data total amount ratio), but substantial amounts of memory headroom can be taken and multiple disk access is needed, because Efficiency during this data de-duplication is not high.Refer to when similarity detection technique to fingerprint sampling by reducing data de-duplication The number of times of line retrieval, so as to improve the efficiency of data de-duplication.

Comparing representational sampling method at present has sparse index sampling, minimum of a value sampling and enhancing location aware to take Sample.Sparse index sampling method select n positions before piecemeal cryptographic Hash be 0 piecemeal as sample, that is, average 2ⁿ1 is selected in block Used as sample, this method saves memory headroom to block, but when file content changes, the position of sample can shift, Therefore the Sample Similarity result that calculates can be affected, and (similarity is the ratio of the common factor with total sample number of two sample sets, should Value is bigger, then obtain the high possibility for deleting rate again bigger).Minimum of a value sampling method selects k Kazakhstan from all piecemeals of file Used as sample, this method can improve to a certain extent the similarity of sample, but its operation expense to the minimum piecemeal of uncommon value Meeting rapid growth with the increase of file size, is not appropriate for processing big file.Strengthen location aware to sample from top of file Middle sampling is offset to according to fixed with afterbody, shadow of the file content change to Sample Similarity is reduced to a certain extent Ring, but its is heavy, and to delete rate not high.

The content of the invention

To there is efficiency low to solve existing data de-duplication method for the present invention, and committed memory space it is big the problems such as, A kind of data de-duplication method in data disaster tolerance system is provided.

Data de-duplication method in data disaster tolerance system, the method will be sampled and is introduced in data blocking process, make to take Sample and deblocking are synchronously completed；Using the cryptographic Hash of sample as fingerprint；The detailed process of the sampling is：

Step one：First file is divided into into section, the number of section is equal to number of samples；

Step 2：A sampling concordance list is set up, keyword is the numbering of section, be worth sample cryptographic Hash and the section for this section The set of the boundary information of all piecemeals；

Step 3：The value for arranging variable min is infinity；

Step 4：Calculate cryptographic Hash hv of sliding window；

Step 5：Whether cryptographic Hash hv described in step 4 is judged less than min, if it is, cryptographic Hash hv is assigned to into variable min；If not, execution step six；

Step 6：Judge whether sliding window reaches block boundary, if it is, by the storage of piecemeal boundary information to correspondent section Block boundary information set in, execution step seven；If it is not, then a byte that sliding window is slided to the right, returns and performs step Rapid four；

Step 7：Judge whether sliding window surmounts segment boundary, if it is, the storage of min values is arrived into the sample of correspondent section In cryptographic Hash；Execution step eight；If not, the distance of a window that sliding window is slided to the right, returns execution step four；

Step 8：Judge whether to reach end of file, if it is, terminating；If not, sliding window is slided to the right one The distance of window, returns execution step three；

The sampling concordance list obtained using above-mentioned steps can carry out the data de-duplication of two files, and detailed process is：

Step a：The sampling concordance list of traversal file A, therefrom obtains the sample cryptographic Hash of section Ai；

Step b：Whether cryptographic Hash described in searching step a is in the sampling concordance list of file B, if it is, obtaining described Cryptographic Hash corresponding section of Bj in file B, execution step c, if not, returning execution step a；

Step c：The fingerprint of each piecemeal in section Ai and section Bj is calculated respectively using fingerprint generating algorithm；

Step d：Fingerprint in traversal section Ai；

Step e：Certain fingerprint is judged with the presence or absence of in section Bj, if it is, by the phase in Bj of the piecemeal corresponding to the fingerprint Replace the address for answering piecemeal；If not, returning execution step d；

Step f：Judge whether all fingerprints in section Ai check to finish, if it is, execution step a is returned, if not, returning Return execution step d.

Beneficial effects of the present invention：Delet method of the present invention by sampling to reduce fingerprint base in fingerprint number Amount, so as to saving memory space and improving fingerprint recall precision.Meanwhile, the present invention uses the cryptographic Hash of sample as its fingerprint, And do not use the fingerprint such as MD5 generating algorithm to calculate the fingerprint of sample, therefore fingerprint base generation and recall precision it is higher.

The present invention selects the sliding window for possessing minimum hash as sample, when the partial content of file is changed, sample This position may not change, i.e. the cryptographic Hash of sample remains minimum, therefore local updating of this mode to file It is insensitive.

Delet method of the present invention is higher compared to location aware sampling method, the Sample Similarity for obtaining is strengthened. The present invention tests respectively Sample Similarity when number of samples is 8 and 16 by experiment, test result indicate that the present invention can be with Obtain higher similarity.

Description of the drawings

Fig. 1 is the sampling flowsheet figure of the data de-duplication method in data disaster tolerance system of the present invention；

Fig. 2 is that the data de-duplication method in data disaster tolerance system of the present invention deletes flow chart again；

Fig. 3 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know correlation result comparison diagram of the sampling method when number of samples is 8；

Fig. 4 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know correlation result comparison diagram of the sampling method when number of samples is 16；

Fig. 5 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know that sampling method deletes rate comparative result figure again when number of samples is 8；

Fig. 6 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know that sampling method deletes rate comparative result figure again when number of samples is 16.

Specific embodiment

Specific embodiment one, present embodiment is illustrated with reference to Fig. 1 to Fig. 6, the duplicate data in data disaster tolerance system deletes Except method, the method will be sampled and is introduced in data blocking process, synchronously complete sampling and deblocking；Using the Hash of sample Value is used as fingerprint；

Basis based on the minimum of a value sampling method of variable length piecemeal is CDC piecemeals, and the method for partition specifically includes as follows Step：

Step 1：The sliding window of one fixed size is set, and D and r is predefined value；

Step 2：File original position is set to into piecemeal left margin, and by sliding window left margin and piecemeal left margin pair Together；

Step 3：If the right margin of sliding window is reached or beyond end of file, the right of value end of file as piecemeal is set Boundary, terminates piecemeal；

Step 4：Calculate cryptographic Hash hv of sliding window.If hv is r to the value of D remainders, or the right margin of sliding window The maximum right margin of piecemeal is reached, then sets right margin of the value sliding window right margin as piecemeal, and sliding window is slided to the right The distance of one window, repeats step 3, and otherwise, a byte that sliding window is slided to the right repeats step 3.

In present embodiment, D and r is exactly arbitrary two integers more than 0, as long as r is less than D, in experimentation, D values take 4096, r and take 13, and generally, the value of D is less than or equal to the size of sliding window, and r is prime number.

Present embodiment is sampled in CDC blocking processes, and with reference to Fig. 1 present embodiment is illustrated, detailed process is：

Step 1：First file is divided into into section, the number of section is equal to number of samples；

Step 3：The value for arranging variable min is infinity；

Step 4：Calculate the cryptographic Hash of sliding window and be assigned to variable hv, compare the size of hv and min, if hv is less than min, Then hv is assigned to into variable min；

Step 5：If sliding window reaches block boundary, by the block boundary information collection of piecemeal boundary information storage to correspondent section In conjunction；

Step 6：If sliding window surmounts segment boundary, min values are stored in the sample cryptographic Hash of correspondent section, then held Row step 3, until sliding window reaches end of file.

With reference to Fig. 2 illustrate present embodiment, data de-duplication is carried out after specific embodiment one, specifically include as Lower step：(two files that hypothesis will be deleted again are respectively A and B)：

Step 1：The sampling concordance list of traversal file A, therefrom obtains the sample cryptographic Hash of section Ai；

Step 2：The cryptographic Hash of searching step 1 whether in the sampling concordance list of file B, if it is present obtaining the value The corresponding section of Bj in file B, execution step 3 otherwise, continues executing with step 1；

Step 3：The fingerprint of each piecemeal in section Ai and section Bj is calculated respectively using the fingerprint such as MD5 generating algorithm；

Step 4：Each fingerprint in traversal section Ai, if the fingerprint is present in section Bj, by dividing corresponding to the fingerprint Block is replaced with the address of corresponding sub-block in Bj；

Step 5：After all fingerprints in section Ai are all checked to be finished, step 1 is continued executing with.

Present embodiment does not use the fingerprint such as MD5 generating algorithm to calculate piecemeal to improve sample rate in CDC piecemeals Fingerprint, being postponed till again the stage of deleting is carried out, and strengthening location aware sampling method needs to perform CDC piecemeals deleting the stage again And calculate the fingerprint of piecemeal.Therefore, the file higher for similitude is deleted again, such as disaster tolerance system periodically full backup number According to storehouse, the total time of deleting again of the present invention remains basically stable with location aware sampling method is strengthened.

Illustrate present embodiment with reference to Fig. 3 to Fig. 6, using the method described in present embodiment obtain to delete rate again higher. Present embodiment to test delete rate again when number of samples is 8 and 16 respectively by experiment, test result indicate that the present invention can be with Obtain and higher delete rate again.Rate is deleted again has been respectively increased 21.5% and 20.3% when N is 8 and 16.

Claims

1. the data de-duplication method in data disaster tolerance system, the method will be sampled and is introduced in data blocking process, make sampling Synchronously complete with deblocking；Using the cryptographic Hash of sample as fingerprint；It is characterized in that；The detailed process of the sampling is：

Step 2：Set up one sampling concordance list, keyword is the numbering of section, be worth for this section sample cryptographic Hash and the section own The set of the boundary information of piecemeal；

Step 3：The value for arranging variable min is infinity；

Step 4：Calculate cryptographic Hash hv of sliding window；

Step 5：Whether cryptographic Hash hv described in step 4 is judged less than min, if it is, cryptographic Hash hv is assigned to into variable min； If not, execution step six；

Step 6：Judge whether sliding window reaches block boundary, if it is, the storage of piecemeal boundary information is arrived into the block of correspondent section In boundary information set, execution step seven；If it is not, then a byte that sliding window is slided to the right, returns execution step four；

Step 7：Judge whether sliding window surmounts segment boundary, if it is, the storage of min values is arrived into the sample Hash of correspondent section In value；Execution step eight；If not, the distance of a window that sliding window is slided to the right, returns execution step four；

Step 8：Judge whether to reach end of file, if it is, terminating；If not, a window that sliding window is slided to the right Distance, return execution step three；

Step b：Cryptographic Hash described in searching step a whether in the sampling concordance list of file B, if it is, obtaining the Hash Value corresponding section of Bj in file B, execution step c, if not, returning execution step a；

Step d：Fingerprint in traversal section Ai；

Step e：Certain fingerprint is judged with the presence or absence of in section Bj, if it is, the piecemeal corresponding to the fingerprint is accordingly divided with Bj Replace the address of block；If not, returning execution step d；

Step f：Judge whether all fingerprints in section Ai check to finish, if it is, execution step a is returned, if not, return holding Row step d.

2. the data de-duplication method in data disaster tolerance system according to claim 1, it is characterised in that adopt CDC Method of partition, detailed process is：

Step 1：The sliding window of one fixed size is set, and D and r is predefined value；The D and r are the integer more than 0；

Step 2：File original position is set to into piecemeal left margin, and sliding window left margin is alignd with piecemeal left margin；

Step 3：If the right margin of sliding window is reached or beyond end of file, right margin of the value end of file as piecemeal is set, Terminate piecemeal；

Step 4：Calculate cryptographic Hash hv of sliding window；If hv is r to the value of D remainders, or the right margin of sliding window is reached The maximum right margin of piecemeal, then set right margin of the value sliding window right margin as piecemeal, and sliding window is slided to the right one The distance of window, repeats step 3, and otherwise, a byte that sliding window is slided to the right repeats step 3.