CN106648991A - Duplicated data deletion method in data recovery system - Google Patents

Duplicated data deletion method in data recovery system Download PDF

Info

Publication number
CN106648991A
CN106648991A CN201611235476.9A CN201611235476A CN106648991A CN 106648991 A CN106648991 A CN 106648991A CN 201611235476 A CN201611235476 A CN 201611235476A CN 106648991 A CN106648991 A CN 106648991A
Authority
CN
China
Prior art keywords
section
sliding window
piecemeal
file
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611235476.9A
Other languages
Chinese (zh)
Inventor
祁晖
底晓强
李锦青
宋小龙
毕琳
蒋振刚
杨华民
从立钢
任维武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Science and Technology
Original Assignee
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Science and Technology filed Critical Changchun University of Science and Technology
Priority to CN201611235476.9A priority Critical patent/CN106648991A/en
Publication of CN106648991A publication Critical patent/CN106648991A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a duplicated data deletion method in a data recovery system, relates to the field of data storage and solves the problem that an existing duplicated data deletion method is low in efficiency and large in occupied memory space. According to the deletion method provided by the invention, the number of fingerprints in a fingerprint base is reduced through sampling; hash values of samples are taken as fingerprints of the samples; and the fingerprints of the samples are not calculated through utilization of fingerprint generating algorithms such as MD5, so the fingerprint base generation and search efficiency are relatively high. According to the method, sliding windows with the minimum hash values are taken as the samples; when partial content of a file is changed, the positions of the samples may not be changed, namely the hash values of the samples are still minimum, so the mode is insensitive to partial update of the file. Compared with an enhanced position sensing sampling method, the deletion method provided by the invention has the advantage that the similarity of the obtained samples is higher. According to the method, the similarity and the duplication deletion rate of the samples when the number of the samples is different are tested, and test results show that according to the method, the higher similarity and duplication deletion rate can be obtained.

Description

Data de-duplication method in data disaster tolerance system
Technical field
The present invention relates to field of data storage, is related to a kind of data de-duplication method, and in particular to data disaster tolerance system In data de-duplication method.
Background technology
Data are the oil of information age, especially in the big data epoch, it is ensured that the security and availability of data is one The highly important task of item.Data disaster tolerance system causes significant data that disaster or human error are occurring by remote backup data Shi Buhui loses.But constantly the data of backup have higher redundancy, not only waste memory space, and also reduce Storage performance, increased carrying cost.Data de-duplication technology can eliminate the redundant data in disaster tolerance system.Duplicate data The general step of deletion is:When file is increased in storage system newly, first piecemeal is carried out to file, then generated using fingerprint and calculated Method calculates the fingerprint of each piecemeal, is subsequently contrasted with existing piecemeal fingerprint in storage system, if the piecemeal fingerprint is not Exist, then store corresponding deblocking, otherwise need to only record the pointer for pointing to data with existing piecemeal to save memory space.
Traditional data de-duplication method needs the whole piecemeal fingerprints of storage, although again deletion rate is high (deletes rate to delete again The data volume removed with delete again before data total amount ratio), but substantial amounts of memory headroom can be taken and multiple disk access is needed, because Efficiency during this data de-duplication is not high.Refer to when similarity detection technique to fingerprint sampling by reducing data de-duplication The number of times of line retrieval, so as to improve the efficiency of data de-duplication.
Comparing representational sampling method at present has sparse index sampling, minimum of a value sampling and enhancing location aware to take Sample.Sparse index sampling method select n positions before piecemeal cryptographic Hash be 0 piecemeal as sample, that is, average 2n1 is selected in block Used as sample, this method saves memory headroom to block, but when file content changes, the position of sample can shift, Therefore the Sample Similarity result that calculates can be affected, and (similarity is the ratio of the common factor with total sample number of two sample sets, should Value is bigger, then obtain the high possibility for deleting rate again bigger).Minimum of a value sampling method selects k Kazakhstan from all piecemeals of file Used as sample, this method can improve to a certain extent the similarity of sample, but its operation expense to the minimum piecemeal of uncommon value Meeting rapid growth with the increase of file size, is not appropriate for processing big file.Strengthen location aware to sample from top of file Middle sampling is offset to according to fixed with afterbody, shadow of the file content change to Sample Similarity is reduced to a certain extent Ring, but its is heavy, and to delete rate not high.
The content of the invention
To there is efficiency low to solve existing data de-duplication method for the present invention, and committed memory space it is big the problems such as, A kind of data de-duplication method in data disaster tolerance system is provided.
Data de-duplication method in data disaster tolerance system, the method will be sampled and is introduced in data blocking process, make to take Sample and deblocking are synchronously completed;Using the cryptographic Hash of sample as fingerprint;The detailed process of the sampling is:
Step one:First file is divided into into section, the number of section is equal to number of samples;
Step 2:A sampling concordance list is set up, keyword is the numbering of section, be worth sample cryptographic Hash and the section for this section The set of the boundary information of all piecemeals;
Step 3:The value for arranging variable min is infinity;
Step 4:Calculate cryptographic Hash hv of sliding window;
Step 5:Whether cryptographic Hash hv described in step 4 is judged less than min, if it is, cryptographic Hash hv is assigned to into variable min;If not, execution step six;
Step 6:Judge whether sliding window reaches block boundary, if it is, by the storage of piecemeal boundary information to correspondent section Block boundary information set in, execution step seven;If it is not, then a byte that sliding window is slided to the right, returns and performs step Rapid four;
Step 7:Judge whether sliding window surmounts segment boundary, if it is, the storage of min values is arrived into the sample of correspondent section In cryptographic Hash;Execution step eight;If not, the distance of a window that sliding window is slided to the right, returns execution step four;
Step 8:Judge whether to reach end of file, if it is, terminating;If not, sliding window is slided to the right one The distance of window, returns execution step three;
The sampling concordance list obtained using above-mentioned steps can carry out the data de-duplication of two files, and detailed process is:
Step a:The sampling concordance list of traversal file A, therefrom obtains the sample cryptographic Hash of section Ai;
Step b:Whether cryptographic Hash described in searching step a is in the sampling concordance list of file B, if it is, obtaining described Cryptographic Hash corresponding section of Bj in file B, execution step c, if not, returning execution step a;
Step c:The fingerprint of each piecemeal in section Ai and section Bj is calculated respectively using fingerprint generating algorithm;
Step d:Fingerprint in traversal section Ai;
Step e:Certain fingerprint is judged with the presence or absence of in section Bj, if it is, by the phase in Bj of the piecemeal corresponding to the fingerprint Replace the address for answering piecemeal;If not, returning execution step d;
Step f:Judge whether all fingerprints in section Ai check to finish, if it is, execution step a is returned, if not, returning Return execution step d.
Beneficial effects of the present invention:Delet method of the present invention by sampling to reduce fingerprint base in fingerprint number Amount, so as to saving memory space and improving fingerprint recall precision.Meanwhile, the present invention uses the cryptographic Hash of sample as its fingerprint, And do not use the fingerprint such as MD5 generating algorithm to calculate the fingerprint of sample, therefore fingerprint base generation and recall precision it is higher.
The present invention selects the sliding window for possessing minimum hash as sample, when the partial content of file is changed, sample This position may not change, i.e. the cryptographic Hash of sample remains minimum, therefore local updating of this mode to file It is insensitive.
Delet method of the present invention is higher compared to location aware sampling method, the Sample Similarity for obtaining is strengthened. The present invention tests respectively Sample Similarity when number of samples is 8 and 16 by experiment, test result indicate that the present invention can be with Obtain higher similarity.
Description of the drawings
Fig. 1 is the sampling flowsheet figure of the data de-duplication method in data disaster tolerance system of the present invention;
Fig. 2 is that the data de-duplication method in data disaster tolerance system of the present invention deletes flow chart again;
Fig. 3 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know correlation result comparison diagram of the sampling method when number of samples is 8;
Fig. 4 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know correlation result comparison diagram of the sampling method when number of samples is 16;
Fig. 5 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know that sampling method deletes rate comparative result figure again when number of samples is 8;
Fig. 6 is to be felt with existing enhancing position using the data de-duplication method in data disaster tolerance system of the present invention Know that sampling method deletes rate comparative result figure again when number of samples is 16.
Specific embodiment
Specific embodiment one, present embodiment is illustrated with reference to Fig. 1 to Fig. 6, the duplicate data in data disaster tolerance system deletes Except method, the method will be sampled and is introduced in data blocking process, synchronously complete sampling and deblocking;Using the Hash of sample Value is used as fingerprint;
Basis based on the minimum of a value sampling method of variable length piecemeal is CDC piecemeals, and the method for partition specifically includes as follows Step:
Step 1:The sliding window of one fixed size is set, and D and r is predefined value;
Step 2:File original position is set to into piecemeal left margin, and by sliding window left margin and piecemeal left margin pair Together;
Step 3:If the right margin of sliding window is reached or beyond end of file, the right of value end of file as piecemeal is set Boundary, terminates piecemeal;
Step 4:Calculate cryptographic Hash hv of sliding window.If hv is r to the value of D remainders, or the right margin of sliding window The maximum right margin of piecemeal is reached, then sets right margin of the value sliding window right margin as piecemeal, and sliding window is slided to the right The distance of one window, repeats step 3, and otherwise, a byte that sliding window is slided to the right repeats step 3.
In present embodiment, D and r is exactly arbitrary two integers more than 0, as long as r is less than D, in experimentation, D values take 4096, r and take 13, and generally, the value of D is less than or equal to the size of sliding window, and r is prime number.
Present embodiment is sampled in CDC blocking processes, and with reference to Fig. 1 present embodiment is illustrated, detailed process is:
Step 1:First file is divided into into section, the number of section is equal to number of samples;
Step 2:A sampling concordance list is set up, keyword is the numbering of section, be worth sample cryptographic Hash and the section for this section The set of the boundary information of all piecemeals;
Step 3:The value for arranging variable min is infinity;
Step 4:Calculate the cryptographic Hash of sliding window and be assigned to variable hv, compare the size of hv and min, if hv is less than min, Then hv is assigned to into variable min;
Step 5:If sliding window reaches block boundary, by the block boundary information collection of piecemeal boundary information storage to correspondent section In conjunction;
Step 6:If sliding window surmounts segment boundary, min values are stored in the sample cryptographic Hash of correspondent section, then held Row step 3, until sliding window reaches end of file.
With reference to Fig. 2 illustrate present embodiment, data de-duplication is carried out after specific embodiment one, specifically include as Lower step:(two files that hypothesis will be deleted again are respectively A and B):
Step 1:The sampling concordance list of traversal file A, therefrom obtains the sample cryptographic Hash of section Ai;
Step 2:The cryptographic Hash of searching step 1 whether in the sampling concordance list of file B, if it is present obtaining the value The corresponding section of Bj in file B, execution step 3 otherwise, continues executing with step 1;
Step 3:The fingerprint of each piecemeal in section Ai and section Bj is calculated respectively using the fingerprint such as MD5 generating algorithm;
Step 4:Each fingerprint in traversal section Ai, if the fingerprint is present in section Bj, by dividing corresponding to the fingerprint Block is replaced with the address of corresponding sub-block in Bj;
Step 5:After all fingerprints in section Ai are all checked to be finished, step 1 is continued executing with.
Present embodiment does not use the fingerprint such as MD5 generating algorithm to calculate piecemeal to improve sample rate in CDC piecemeals Fingerprint, being postponed till again the stage of deleting is carried out, and strengthening location aware sampling method needs to perform CDC piecemeals deleting the stage again And calculate the fingerprint of piecemeal.Therefore, the file higher for similitude is deleted again, such as disaster tolerance system periodically full backup number According to storehouse, the total time of deleting again of the present invention remains basically stable with location aware sampling method is strengthened.
Illustrate present embodiment with reference to Fig. 3 to Fig. 6, using the method described in present embodiment obtain to delete rate again higher. Present embodiment to test delete rate again when number of samples is 8 and 16 respectively by experiment, test result indicate that the present invention can be with Obtain and higher delete rate again.Rate is deleted again has been respectively increased 21.5% and 20.3% when N is 8 and 16.

Claims (2)

1. the data de-duplication method in data disaster tolerance system, the method will be sampled and is introduced in data blocking process, make sampling Synchronously complete with deblocking;Using the cryptographic Hash of sample as fingerprint;It is characterized in that;The detailed process of the sampling is:
Step one:First file is divided into into section, the number of section is equal to number of samples;
Step 2:Set up one sampling concordance list, keyword is the numbering of section, be worth for this section sample cryptographic Hash and the section own The set of the boundary information of piecemeal;
Step 3:The value for arranging variable min is infinity;
Step 4:Calculate cryptographic Hash hv of sliding window;
Step 5:Whether cryptographic Hash hv described in step 4 is judged less than min, if it is, cryptographic Hash hv is assigned to into variable min; If not, execution step six;
Step 6:Judge whether sliding window reaches block boundary, if it is, the storage of piecemeal boundary information is arrived into the block of correspondent section In boundary information set, execution step seven;If it is not, then a byte that sliding window is slided to the right, returns execution step four;
Step 7:Judge whether sliding window surmounts segment boundary, if it is, the storage of min values is arrived into the sample Hash of correspondent section In value;Execution step eight;If not, the distance of a window that sliding window is slided to the right, returns execution step four;
Step 8:Judge whether to reach end of file, if it is, terminating;If not, a window that sliding window is slided to the right Distance, return execution step three;
The sampling concordance list obtained using above-mentioned steps can carry out the data de-duplication of two files, and detailed process is:
Step a:The sampling concordance list of traversal file A, therefrom obtains the sample cryptographic Hash of section Ai;
Step b:Cryptographic Hash described in searching step a whether in the sampling concordance list of file B, if it is, obtaining the Hash Value corresponding section of Bj in file B, execution step c, if not, returning execution step a;
Step c:The fingerprint of each piecemeal in section Ai and section Bj is calculated respectively using fingerprint generating algorithm;
Step d:Fingerprint in traversal section Ai;
Step e:Certain fingerprint is judged with the presence or absence of in section Bj, if it is, the piecemeal corresponding to the fingerprint is accordingly divided with Bj Replace the address of block;If not, returning execution step d;
Step f:Judge whether all fingerprints in section Ai check to finish, if it is, execution step a is returned, if not, return holding Row step d.
2. the data de-duplication method in data disaster tolerance system according to claim 1, it is characterised in that adopt CDC Method of partition, detailed process is:
Step 1:The sliding window of one fixed size is set, and D and r is predefined value;The D and r are the integer more than 0;
Step 2:File original position is set to into piecemeal left margin, and sliding window left margin is alignd with piecemeal left margin;
Step 3:If the right margin of sliding window is reached or beyond end of file, right margin of the value end of file as piecemeal is set, Terminate piecemeal;
Step 4:Calculate cryptographic Hash hv of sliding window;If hv is r to the value of D remainders, or the right margin of sliding window is reached The maximum right margin of piecemeal, then set right margin of the value sliding window right margin as piecemeal, and sliding window is slided to the right one The distance of window, repeats step 3, and otherwise, a byte that sliding window is slided to the right repeats step 3.
CN201611235476.9A 2016-12-28 2016-12-28 Duplicated data deletion method in data recovery system Pending CN106648991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611235476.9A CN106648991A (en) 2016-12-28 2016-12-28 Duplicated data deletion method in data recovery system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611235476.9A CN106648991A (en) 2016-12-28 2016-12-28 Duplicated data deletion method in data recovery system

Publications (1)

Publication Number Publication Date
CN106648991A true CN106648991A (en) 2017-05-10

Family

ID=58831878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611235476.9A Pending CN106648991A (en) 2016-12-28 2016-12-28 Duplicated data deletion method in data recovery system

Country Status (1)

Country Link
CN (1) CN106648991A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256003A (en) * 2017-12-29 2018-07-06 天津南大通用数据技术股份有限公司 A kind of method that union operation efficiencies are improved according to analysis Data duplication rate
CN109597574A (en) * 2018-11-27 2019-04-09 深圳市酷开网络科技有限公司 Distributed data storage method, server and readable storage medium storing program for executing
CN110245245A (en) * 2019-06-20 2019-09-17 菏泽学院 A kind of device adding decorative element automatically in planar design
CN110427348A (en) * 2019-07-31 2019-11-08 关振宇 Wisdom security system data sharing method based on big data
US10599531B2 (en) 2018-02-16 2020-03-24 International Business Machines Corporation Using data set copies for recovery of a data set in event of a failure

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN102467571A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block partition method and addition method for data de-duplication
CN103345449A (en) * 2013-06-19 2013-10-09 暨南大学 Method and system for prefetching fingerprints oriented to data de-duplication technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN102467571A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block partition method and addition method for data de-duplication
CN103345449A (en) * 2013-06-19 2013-10-09 暨南大学 Method and system for prefetching fingerprints oriented to data de-duplication technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUI QI 等: "Minimum value sampling algorithm based on CDC", 《2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT)》 *
WEN XIA 等: "Fastcdc: a fast and efficient content-defined chunking approach for data deduplication", 《USENIX ATC "16 PROCEEDINGS OF THE 2016 USENIX CONFERENCE ON USENIX ANNUAL TECHNICAL CONFERENCE》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256003A (en) * 2017-12-29 2018-07-06 天津南大通用数据技术股份有限公司 A kind of method that union operation efficiencies are improved according to analysis Data duplication rate
US10599531B2 (en) 2018-02-16 2020-03-24 International Business Machines Corporation Using data set copies for recovery of a data set in event of a failure
CN109597574A (en) * 2018-11-27 2019-04-09 深圳市酷开网络科技有限公司 Distributed data storage method, server and readable storage medium storing program for executing
CN109597574B (en) * 2018-11-27 2021-09-24 深圳市酷开网络科技股份有限公司 Distributed data storage method, server and readable storage medium
CN110245245A (en) * 2019-06-20 2019-09-17 菏泽学院 A kind of device adding decorative element automatically in planar design
CN110427348A (en) * 2019-07-31 2019-11-08 关振宇 Wisdom security system data sharing method based on big data

Similar Documents

Publication Publication Date Title
CN106648991A (en) Duplicated data deletion method in data recovery system
CN104978151B (en) Data reconstruction method in the data de-duplication storage system perceived based on application
US9529912B2 (en) Metadata querying method and apparatus
CN103488709B (en) A kind of index establishing method and system, search method and system
US9715505B1 (en) Method and system for maintaining persistent live segment records for garbage collection
US9594674B1 (en) Method and system for garbage collection of data storage systems using live segment records
CN107391774B (en) The rubbish recovering method of log file system based on data de-duplication
US8190591B2 (en) Bit string searching apparatus, searching method, and program
US8225060B2 (en) Data de-duplication by predicting the locations of sub-blocks within the repository
CN109325032B (en) Index data storage and retrieval method, device and storage medium
US20120136842A1 (en) Partitioning method of data blocks
WO2014067063A1 (en) Duplicate data retrieval method and device
CN101751475B (en) Method for compressing section records and device therefor
CN107515931B (en) Repeated data detection method based on clustering
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
CN104462388B (en) A kind of redundant data method for cleaning based on tandem type storage medium
CN109445703B (en) A kind of Delta compression storage assembly based on block grade data deduplication
CN116450656B (en) Data processing method, device, equipment and storage medium
WO2016091282A1 (en) Apparatus and method for de-duplication of data
CN113767378A (en) File system metadata deduplication
US20220100718A1 (en) Systems, methods and devices for eliminating duplicates and value redundancy in computer memories
CN116382588A (en) LSM-Tree storage engine read amplification problem optimization method based on learning index
CN114115734A (en) Data deduplication method, device, equipment and storage medium
CN114327252A (en) Data reduction in block-based storage systems using content-based block alignment
CN110413617B (en) Method for dynamically adjusting hash table group according to size of data volume

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170510

WD01 Invention patent application deemed withdrawn after publication