WO2013075668A1 - Duplicate data deletion method and device - Google Patents

Duplicate data deletion method and device Download PDF

Info

Publication number
WO2013075668A1
WO2013075668A1 PCT/CN2012/085278 CN2012085278W WO2013075668A1 WO 2013075668 A1 WO2013075668 A1 WO 2013075668A1 CN 2012085278 W CN2012085278 W CN 2012085278W WO 2013075668 A1 WO2013075668 A1 WO 2013075668A1
Authority
WO
WIPO (PCT)
Prior art keywords
fingerprint
sampling
file
group
stored
Prior art date
Application number
PCT/CN2012/085278
Other languages
French (fr)
Chinese (zh)
Inventor
付旭东
徐君
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2013075668A1 publication Critical patent/WO2013075668A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Definitions

  • Deduplication is a data reduction technique commonly used in disk-based backup systems to reduce the storage capacity used in storage systems.
  • deduplication technology deals with large data volume scenarios.
  • the industry's deduplication technology mainly includes techniques such as block detection, similarity detection, and delta coding. Among them, similarity detection and Delta coding are the other two.
  • a data compression method but only detects similar files, and the deduplication rate is low.
  • Deduplication is an important indicator of the effectiveness of deduplication and identifies the rate at which duplicate data is removed.
  • FIG. 1 is a schematic diagram of a process of the data deduplication technology in the prior art.
  • a huge block data index table is established in the memory to maintain the index of the block data.
  • the data object is subjected to block processing, and the fingerprint of each block in the block processing result is calculated, and the fingerprint of each block is stored in the block data index table (ie, the fingerprint database), as shown in FIG. 2
  • the block data index table ie, the fingerprint database
  • FIG. 2 A schematic diagram of the structure of a fingerprint library of the prior art deduplication technology is shown. When the data is stored, the data index table is first queried.
  • the data to be stored is not stored, and only the new fingerprint in the block data index table is not queried. Block, thereby avoiding the block of storing duplicate content, that is, equivalent to deleting the block that implements the content repetition.
  • the prior art has at least the following drawbacks: In the scenario of storing large amounts of data, a large amount of block data is generated in the process of deduplication, and a large amount in the fingerprint database When the block fingerprints are compared one by one, the calculation amount and the memory requirement are large, which results in low efficiency of the deduplication process.
  • the present invention provides a deduplication method and apparatus, which solves the problem that the amount of calculation and the consumption of resources required for deduplication in the prior art are large, resulting in low deduplication performance.
  • the present invention provides a method for deduplication, including:
  • the packet sampling library is composed of at least one sample group
  • the fingerprint library is composed of at least one fingerprint group
  • each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library.
  • the similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
  • the present invention provides a data deduplication apparatus, including:
  • a blocking module configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result
  • a sampling module configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints;
  • a grouping module configured to determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the grouping sample library;
  • a deduplication module configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar grouping in the fingerprint database;
  • the packet sampling library is composed of at least one sample group
  • the fingerprint library is composed of at least one fingerprint group, each sample group in the group sample library and a sample in each fingerprint sample table in the fingerprint library A sample group that matches the fingerprint.
  • the method and device for deduplication performs block processing on a file to be stored, calculates a fingerprint of each block, samples a fingerprint of each block, and determines the waiting according to the generated fingerprint sample table and the group sampling database.
  • Store a similar group of files in the grouping sample library, and according to the fingerprint library The fingerprint data in the fingerprint group corresponding to the similar group is subjected to deduplication processing on the file to be stored; in this embodiment, the fingerprint of each block is further sampled, and similar groups are first determined by similarity analysis, and then Deduplication processing is performed in the fingerprint group corresponding to the similar group, which reduces the amount of deduplication query calculation, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the prior art is huge, and the repetition is reduced. Deduplication calculation in data deletion, improved deduplication performance
  • FIG. 1 is a schematic diagram of a process of a data deduplication technology in the prior art
  • FIG. 2 is a schematic structural diagram of a fingerprint library of a data deduplication technology in the prior art
  • Embodiment 3 is a flowchart of Embodiment 1 of a method for deleting data in the present invention
  • Embodiment 4 is a flowchart of Embodiment 2 of a method for deleting data in the present invention
  • FIG. 5 is a schematic structural diagram of a packet sampling library in the second embodiment of the data deduplication method of the present invention
  • FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication device of the present invention
  • FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention.
  • the technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention.
  • the embodiments are a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
  • FIG. 3 is a flowchart of Embodiment 1 of the data deduplication method of the present invention. As shown in FIG. 3, the embodiment provides a method for deleting data, which may specifically include the following steps:
  • Step 301 Perform block processing on the storage file, and calculate each block in the block processing result. Fingerprint.
  • the same deduplication method is performed for the storage of each file, and the file is the file to be stored before being stored.
  • the storage file is processed into blocks, and the specific block processing process can use the blocking technology in the prior art, for example, by using the variable length blocking algorithm to block the storage file.
  • the specific fingerprint calculation process can also use the calculation method in the prior art. For example, the shal and md5 double hash algorithm can be used to calculate the fingerprint of each block. .
  • Step 302 Perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints.
  • the fingerprints are sampled, and the basic requirement of the sampling is that the fingerprint in the sampling result is waiting
  • the range of fingerprints of each block of the deleted file is deleted, and the number of fingerprints in the sampled result is not more than the number of blocked fingerprints of the file to be deleted.
  • the sampling factor can be used to sample the fingerprints of each block, and the sampling factor refers to the sampling feature value used for logical operation with each block fingerprint of the file, and the sampling process is based on the sampling rule and the logical operation result. Select a part of the fingerprint in the fingerprint. Different sampling factors are selected for different files to be stored, and the sampling factor can represent the characteristics of the file to be stored. Here, the fingerprint of each block is sampled, the fingerprint is selected, and the fingerprint is extracted according to the fingerprint. Generate a fingerprint sample table of the file to be stored. The sampling factor may be determined according to the file size and the number of blocks of the file to be stored, or may be sampled by using a predetermined fixed sampling factor. The fingerprint in the fingerprint sampling table may be a fingerprint that is retained by the selection and can represent the characteristics of the file to be stored, so that the storage amount of the subsequent fingerprint can be reduced.
  • the sampling process of the fingerprints of each block in this step may also be independent of the sampling factor, specifically: directly taking the fingerprint of the last byte of each block fingerprint as the fingerprint extracted by the sampling process; Or use the block at the fixed position as the extracted fingerprint, for example, the block at the integer multiple of 9 is extracted as the fingerprint; or according to the predetermined sampling ratio, for example, randomly extracting 5% of the block as the extraction Fingerprint to.
  • Step 303 Determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling database.
  • this step is based on the fingerprint sampling table and
  • the current packet sampling library saved in the storage system determines a similar group to which the file to be stored belongs in the current packet sampling library.
  • the packet sampling library is composed of at least one sample group, each sample group including a sample fingerprint of one or more stored files having similarities.
  • the similar packet is a sample packet in the packet sampling library that matches the sample fingerprint in the fingerprint sample table of the file to be stored.
  • a fingerprint library is also provided in the embodiment, and the fingerprint library is composed of at least one fingerprint group.
  • Each fingerprint group is corresponding to each sample group in the grouping sample library, and the fingerprint stored in the fingerprint group is the fingerprint of the stored file after the deduplication process, that is, each sample group of the group sample library is saved
  • the sample fingerprint of each stored file of similarity, and each fingerprint group of the fingerprint library stores all the fingerprints of the stored files after deduplication. If the file to be stored processed in this step is the first file, the packet sampling library at this time is empty.
  • the step may be: performing matching processing on each fingerprint in the fingerprint sampling table and the current grouping sampling database, and obtaining a sampling group matching the sampling fingerprint of the file to be stored from the grouping sampling database by matching, that is, obtaining the waiting group A similar group to which the file belongs.
  • Step 304 Perform data deletion on the file to be stored according to fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
  • the similar grouping is determined according to the determined Determining a fingerprint group corresponding thereto in the fingerprint database, and performing deduplication processing on the file to be stored according to the determined fingerprint data in the fingerprint group.
  • the specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint stored in the fingerprint group corresponding to the determined similar group.
  • the embodiment reduces the fingerprint matching the query in the deduplication process from the fingerprint database to a fingerprint packet in the fingerprint database, which greatly reduces the calculation amount of the query matching.
  • the embodiment provides a method for deleting data by performing block processing on a file to be stored, calculating a fingerprint of each block, sampling a fingerprint of each block, and determining the fingerprint sample table and the group sampling database according to the generated fingerprint sample table.
  • the similarity of the file to be stored in the grouping sample library And performing a data deletion process on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; in this embodiment, the fingerprint of each block is further sampled, and the similarity is first adopted.
  • the analysis determines the similar grouping, and performs the deduplication processing in the fingerprint group corresponding to the similar grouping, which reduces the amount of deduplication query calculation, and solves the huge calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication.
  • the problem is to reduce the amount of deduplication in deduplication and improve the deduplication performance.
  • FIG. 4 is a flowchart of Embodiment 2 of the method for deleting data in the present invention. As shown in FIG. 4, the embodiment provides a method for deleting data, which may specifically include the following steps:
  • Step 401 Perform a block processing on the storage file, and calculate a fingerprint of each block in the block processing result. This step may be similar to the foregoing step 301, and details are not described herein again.
  • Step 402 Determine a sampling factor according to a file feature of the file to be stored.
  • This step is to determine the sampling factor used for sampling. Specifically, the step determines a sampling factor according to the file characteristics of the file to be stored, where the file feature of the file to be stored may be the file size, the number of blocks, and the like of the file to be stored, and the sampling factors determined by different files to be stored may be different. . For example, when the number of blocks of the file to be stored is greater than 1 million, the sampling factor of the file to be stored is determined to be OxFFF; when the number of blocks of the file to be stored is less than 1 million and greater than 100,000, the sampling of the file to be stored is determined.
  • the factor is 0x3FF; when the number of blocks of the file to be stored is less than 100,000 and greater than 10,000, the sampling factor of the file to be stored is determined to be 0x2FF; when the number of blocks of the file to be stored is less than 10,000, the file to be stored is determined.
  • the sampling factor is 0x3F.
  • Step 403 Sampling the fingerprints of all the blocks of the stored file by using the sampling factor according to the set sampling condition.
  • sampling can be performed using the sampling factor according to the set sampling conditions. For example, the fingerprint of each block can be performed in conjunction with the sampling factor to determine whether the result is 0. If it is 0, the set sampling condition is met.
  • Step 404 Add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.
  • the sampling results corresponding to each block are respectively obtained, and the fingerprint of the block corresponding to the sampling result meeting the sampling condition is added to the fingerprint sampling of the file to be stored.
  • the fingerprint of the block with the result of the sampling factor and the result of 0 may be added to the fingerprint sampling table, and the rest may be retained in the database in which the fingerprint is originally stored, thereby forming a fingerprint sampling table of the file to be stored.
  • Step 405 Determine whether the fingerprint sampling table of the file to be stored is empty. If yes, execute step 406; otherwise, perform step 407.
  • the file to be stored is matched and stored in a group according to the fingerprint sampling table and the current group sampling database. In this step, it is determined whether the fingerprint sampling table of the file to be stored is empty, that is, whether the fingerprinting process satisfies the fingerprint of the sampling condition, and if yes, step 406 is performed; otherwise, step 407 is performed.
  • Step 406 Determine a similar group to which the file to be stored belongs in the group sampling database as a preset group in the group sampling database, and perform step 411.
  • the fingerprint sampling table of the file to be stored When the fingerprint sampling table of the file to be stored is empty, it indicates that the sampling result obtained by the sampling process does not satisfy the sampling condition, that is, the block that satisfies the sampling condition does not exist in the file to be stored, and then the file to be stored is determined to be in the current
  • the similar grouping in the group sampling database is the preset grouping in the current grouping sampling database.
  • the similarity analysis process in this embodiment ends, and the subsequent step 411 is executed, in the fingerprint group corresponding to the preset group in the fingerprint database. Deduplication processing of stored files.
  • the preset grouping is a group preset in this embodiment, and has no specific meaning, the preset grouping may be empty, which corresponds to a specific fingerprint group in the fingerprint library, and the specific fingerprint group is saved in the specific fingerprint group.
  • These sampled fingerprint samples are empty fingerprints of the files to be stored.
  • the fingerprint sampling table is empty after sampling.
  • only the processing in this special case is explained to avoid the interruption of the entire process due to such special circumstances.
  • Step 407 Perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database.
  • the fingerprint sampling table of the file to be stored When the fingerprint sampling table of the file to be stored is not empty, it indicates that the sampling process satisfies the sampling condition satisfying the sampling condition, and the fingerprints satisfying the sampling condition stored in the fingerprint sampling table are matched, specifically, the fingerprint sampling table is Each fingerprint is matched with the current packet sampling library.
  • the fingerprints in the packet sampling library are stored in groups, and the fingerprints in each group are sampled fingerprints of one or more files having a certain similarity.
  • the fingerprints in the fingerprint sampling table and the fingerprints in each group in the group sampling database are compared one by one in units of each group in the current grouping sampling database, and the matching result corresponding to each group can be obtained, and the matching result is obtained.
  • the similarity may be the ratio of the number of fingerprints that are the same as or similar to the fingerprint in the corresponding group to the total number of fingerprints in the fingerprint sampling table.
  • Step 408 Determine whether the capacity of the packet sampling library has reached the capacity upper limit. If yes, execute step 409; otherwise, perform step 410.
  • this step it is determined whether the capacity of the current packet sampling library has reached the upper limit of the capacity.
  • the structure of the packet sampling library in the second embodiment of the deduplication method of the present invention is determined, that is, whether the packet sampling library is full or not is determined. If yes, go to step 409, otherwise go to step 410.
  • Step 409 Determine, that the similar group to which the file to be stored belongs in the grouping sample database is a group with the highest similarity between each of the fingerprint sampling tables and the fingerprint sampling table.
  • the packet with the highest similarity between each fingerprint in the current fingerprint sampling database and the fingerprint sampling table is obtained from the matching result, and the determination is made.
  • the similar group to which the storage file belongs in the current packet sampling database is the group with the highest similarity among the fingerprint samples in the current sample sampling database, and step 411 is performed.
  • Step 410 Determine, according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the group sampling database, a similar group to which the file to be stored belongs in the grouping sample database.
  • determining the to-be-storage according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the current grouping sampling database A similar group to which the file belongs in the current grouping of sample samples. Specifically, when the fingerprint similarity between the fingerprints in the fingerprint sampling table and one sampling group in the current grouping sampling database is greater than or equal to a preset similarity threshold, the file to be stored is considered to belong to the sampling group, directly Determining that the similar group to which the file to be stored belongs in the current packet sampling library is the sample group, and performing step 411.
  • the fingerprint matching when the first sampling group that satisfies the above similarity condition occurs, the sampling group is used as the similar group selected by the similarity analysis, and the subsequent sampling with other sampling groups is no longer performed.
  • the matching process can reduce the computational complexity of the similarity analysis algorithm and improve the performance of the similarity analysis algorithm.
  • the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the current packet sampling database is less than a preset similarity threshold, it indicates that the file to be stored does not belong to any packet in the current packet sampling library.
  • step 411 create a new group in the current grouping sample library, determine The similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the newly created group, and step 411 is performed.
  • the packet is selected as the group selected by the similarity analysis, and the subsequent packets need not be matched. This embodiment significantly reduces the amount of calculation of the similarity analysis algorithm.
  • the file to be stored when the file to be stored is the first file to be stored, determining, according to the fingerprint sampling table and the current grouping sample database, the file to be stored belongs to the current grouping sample library. Similar grouping: establishing a new grouping in the current grouping sample library, determining that the similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and sampling the fingerprint of the file to be stored The fingerprints in the table are saved to the new group.
  • Step 411 Perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
  • the data to be stored in the fingerprint database corresponding to the similar grouping in the fingerprint database is subjected to deduplication processing.
  • the specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint saved in the fingerprint group corresponding to the similar group. If a fingerprint corresponding to a block has been saved in the fingerprint group corresponding to the similar group, the data of the block is deleted; if the fingerprint group corresponding to the similar group does not have the same or similar fingerprint as a block, The blockd data is then stored. It can be seen that, in this embodiment, the fingerprint matching range of the query in the deduplication process is reduced from the entire fingerprint database to a fingerprint group in the fingerprint database, which greatly reduces the calculation amount of the query matching.
  • This embodiment provides a method for deleting data by performing block processing on a file to be stored, and calculating a fingerprint of each block, and sampling the fingerprint by using a sampling factor, according to the generated fingerprint sample table and the current grouping sample database. Determining a similar group to which the file to be stored belongs in the current packet sampling database, and performing deduplication processing on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; The fingerprints of each block are further sampled, and the similarity group is first determined by the similarity analysis, and then the data deletion processing is performed in the fingerprint group corresponding to the similar group, thereby reducing the calculation amount of the deduplication query.
  • the problem that the calculation amount and the resource consumption introduced by the massive block data in the deduplication in the prior art is solved is solved, the dequantization calculation amount in the deduplication is reduced, and the deduplication performance is improved.
  • FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication apparatus of the present invention.
  • the embodiment provides a deduplication apparatus, which may specifically perform the steps in Embodiment 1 of the foregoing method. Let me repeat.
  • the deduplication apparatus provided in this embodiment may specifically include a blocking module 601, a sampling module 602, a grouping module 603, and a deduplication module 604.
  • the blocking module 601 is configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result.
  • the sampling module 602 is configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints.
  • the grouping module 603 is configured to determine a similar grouping of the file to be stored in the grouping sample library based on the fingerprint sampling table and the grouping sampling database.
  • the deduplication module 604 is configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
  • the packet sampling library is composed of at least one sample group
  • the fingerprint library is composed of at least one fingerprint group
  • each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library.
  • the similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
  • FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention.
  • the embodiment provides a deduplication apparatus, which may specifically perform the steps in the second embodiment of the foregoing method. Let me repeat.
  • the deduplication apparatus provided in this embodiment is based on the above-described FIG. 6, and the sampling module 602 may specifically include a determining unit 612, a sampling unit 622, and a generating unit 632.
  • the determining unit 612 is configured to determine a sampling factor according to a file feature of the file to be stored, where the file feature includes a file size and a number of blocks of the file to be stored.
  • the sampling unit 622 is configured to perform sampling processing on the fingerprints of all the blocks of the file to be stored by using the sampling factor according to the set sampling condition.
  • the generating unit 632 is configured to add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.
  • the grouping module 603 in this embodiment may specifically include a first grouping unit 613, a matching unit 623, and a second grouping unit 633.
  • the first grouping unit 613 is configured to determine, when the fingerprint sampling table is empty, that the similar group to which the file to be stored belongs in the grouping sample library is a preset group in the grouping sample database.
  • the matching unit 623 is configured to perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database when the fingerprint sampling table is not empty.
  • the second grouping unit 633 is configured to determine, according to the matching result, a similar group to which the file to be stored belongs in the grouping sample library.
  • the second grouping unit 633 in this embodiment may specifically include a first grouping subunit 6331, a second grouping subunit 6332, and a third grouping subunit 6333.
  • the first grouping subunit 6331 is configured to determine, if the capacity of the packet sampling library has reached a capacity upper limit, a similar grouping that the file to be stored belongs to in the grouping sample database, and the fingerprint in the grouping sampling database The group with the highest similarity of each fingerprint in the sampling table.
  • the second grouping sub-unit 6332 is configured to: if the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint in the fingerprint sampling table and the fingerprint group of the sample sampling group have a similarity of the fingerprint greater than or equal to When the preset similarity threshold is used, it is determined that the similar group to which the file to be stored belongs in the packet sampling library is the sample group.
  • the third grouping sub-unit 6333 is configured to: if the capacity of the grouping sample library does not reach the upper limit of the capacity, and the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the groups in the grouping sample library is less than a preset a similarity threshold, establishing a new group in the grouping sample library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and sampling the fingerprint of the file to be stored The fingerprint in is saved to the new group.
  • the grouping module 603 in this embodiment may be specifically configured to establish a new group in the grouping sample library, and determine the file to be stored.
  • the similar group to which the packet sampling library belongs is the new group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the new group.
  • the embodiment provides a data deduplication device, which performs block processing on the file to be stored, calculates a fingerprint of each block, and performs sampling processing on the fingerprint; and determines the file to be stored according to the generated fingerprint sample table and the group sampling database. And the similar groupings in the grouping of the groupings; and performing the data deletion processing on the files to be stored according to the fingerprint data in the fingerprint group corresponding to the similar groupings in the fingerprint database; For further sampling processing, similarity analysis is first determined by similarity analysis, and then weighted in fingerprint groups corresponding to similar groups.
  • the data deletion processing reduces the dequantization query calculation amount, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication is large, and the calculation amount of deduplication in the deduplication is reduced. Improved deduplication performance.

Abstract

Provided is a duplicate data deletion method and device. The method includes: partitioning a file to be stored, and calculating a fingerprint of each partition in the partitioning processing result; sampling the fingerprint of each partition, and generating a fingerprint sampling table for the file to be stored according to the sampled fingerprint; determining a similar grouping of the file to be stored in a grouping sampling library according to the fingerprint sampling table and the grouping sampling library; and performing duplicate data deletion on the file to be stored according to the fingerprint data in a fingerprint grouping corresponding to the similar grouping in a fingerprint library. The device includes: a partitioning module, a sampling module, a grouping module and a duplicate data deletion module. The present invention solves the problem in the prior art that the calculation amount and the resource consumption introduced by massive partitioned data during duplicate deletion are huge and reduces the calculation amount of de-duplication during duplicate data deletion.

Description

重复数据删除方法和装置 本申请要求于 2011年 11月 25日提交中国专利局、 申请号为  Deduplication method and device This application is filed on November 25, 2011 and submitted to the China Patent Office.
201110380773.3、 发明名称为 "重复数据删除方法和装置" 的中国专利申请的 优先权, 其全部内容通过引用结合在本申请中。 技术领域 本发明涉及数据存储技术领域, 尤其涉及一种重复数据删除方法和装置。 背景技术 重复数据删除 (简称为重删 )是一种数据缩减技术, 通常用于基于磁盘的 备份系统, 旨在减少存储系统中所使用的存储容量。 通常, 重复数据删除技术 应对的均为大数据量的场景, 业界的重复数据删除技术主要包括分块检测、相 似性检测及 Delta编码等技术, 其中, 基于相似性检测和 Delta编码为另外的 两种数据压缩方法, 但是只检测相似的文件, 去重率较低。 去重率为衡量重删 效果的重要指标, 标识去除重复数据的比率。 The priority of the Chinese Patent Application, the entire disclosure of which is hereby incorporated by reference. The present invention relates to the field of data storage technologies, and in particular, to a data deletion method and apparatus. BACKGROUND OF THE INVENTION Deduplication (referred to as deduplication) is a data reduction technique commonly used in disk-based backup systems to reduce the storage capacity used in storage systems. Generally, deduplication technology deals with large data volume scenarios. The industry's deduplication technology mainly includes techniques such as block detection, similarity detection, and delta coding. Among them, similarity detection and Delta coding are the other two. A data compression method, but only detects similar files, and the deduplication rate is low. Deduplication is an important indicator of the effectiveness of deduplication and identifies the rate at which duplicate data is removed.
图 1为现有技术中重复数据删除技术的过程示意图, 如图 1所示, 在现有 技术中的重删技术中,通过内存中建立一个巨大的块数据索引表来维持分块数 据的索引。 在进行重复数据删除时, 将数据对象进行分块处理, 并计算分块处 理结果中各分块的指紋,将各分块的指紋存储在块数据索引表(即指紋库)中, 如图 2所示为现有技术中重复数据删除技术的指紋库的结构示意图。后续在存 储数据时先查询该块数据索引表,若查询到与待存储数据的指紋相同的分块指 紋, 则不存储待存储数据, 只存储在块数据索引表中没有查询到相同指紋的新 块, 从而避免存储内容重复的块, 也就是相当于实现了内容重复的块的删除。  FIG. 1 is a schematic diagram of a process of the data deduplication technology in the prior art. As shown in FIG. 1 , in the prior art deduplication technology, a huge block data index table is established in the memory to maintain the index of the block data. . When performing deduplication, the data object is subjected to block processing, and the fingerprint of each block in the block processing result is calculated, and the fingerprint of each block is stored in the block data index table (ie, the fingerprint database), as shown in FIG. 2 A schematic diagram of the structure of a fingerprint library of the prior art deduplication technology is shown. When the data is stored, the data index table is first queried. If the fingerprint of the same data as the fingerprint of the data to be stored is queried, the data to be stored is not stored, and only the new fingerprint in the block data index table is not queried. Block, thereby avoiding the block of storing duplicate content, that is, equivalent to deleting the block that implements the content repetition.
然而, 发明人在实现本发明的过程中, 发现现有技术至少存在以下缺陷: 在存储大数据量的场景中, 在重删过程中会产生大量的分块数据, 在与指 紋库中的海量分块指紋进行逐一比对时, 计算量和内存需求很大, 从而造成重 删处理的效率较低。  However, in the process of implementing the present invention, the inventors have found that the prior art has at least the following drawbacks: In the scenario of storing large amounts of data, a large amount of block data is generated in the process of deduplication, and a large amount in the fingerprint database When the block fingerprints are compared one by one, the calculation amount and the memory requirement are large, which results in low efficiency of the deduplication process.
发明内容 本发明提供一种重复数据删除方法和装置, 解决现有技术中重删时所 需的计算量和消耗资源巨大, 导致的重删性能较低的问题。 Summary of the invention The present invention provides a deduplication method and apparatus, which solves the problem that the amount of calculation and the consumption of resources required for deduplication in the prior art are large, resulting in low deduplication performance.
本发明提供了一种重复数据删除方法, 包括:  The present invention provides a method for deduplication, including:
对待存储文件进行分块处理, 计算分块处理结果中各分块的指紋; 对所述各分块的指紋进行抽样处理, 并根据抽取到的指紋生成所述待 存储文件的指紋抽样表;  Performing a block processing on the storage file, and calculating a fingerprint of each block in the block processing result; sampling the fingerprint of each block, and generating a fingerprint sample table of the file to be stored according to the extracted fingerprint;
根据指紋抽样表和分组抽样库, 确定所述待存储文件在所述分组抽样 库中所属的相似分组;  Determining, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling database;
根据指紋库中与所述相似分组对应的指紋分组中的指紋数据, 对所述 待存储文件进行重复数据删除;  And performing deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar grouping in the fingerprint database;
其中, 所述分组抽样库由至少一个抽样分组构成, 所述指紋库由至少 一个个指紋分组构成, 所述分组抽样库中的各抽样分组与所述指紋库中的 各指紋分组——对应, 所述相似分组为所述分组抽样库中与所述待存储文 件的指紋抽样表中的抽样指紋相匹配的一个抽样分组。  The packet sampling library is composed of at least one sample group, the fingerprint library is composed of at least one fingerprint group, and each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library. The similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
本发明提供了一种重复数据删除装置, 包括:  The present invention provides a data deduplication apparatus, including:
分块模块, 用于对待存储文件进行分块处理, 计算分块处理结果中各 分块的指紋;  a blocking module, configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result;
抽样模块, 用于对所述各分块的指紋进行抽样处理, 并根据抽取到的 指紋生成所述待存储文件的指紋抽样表;  a sampling module, configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints;
分组模块, 用于根据指紋抽样表和分组抽样库, 确定所述待存储文件 在所述分组抽样库中所属的相似分组;  a grouping module, configured to determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the grouping sample library;
重复数据删除模块, 用于根据指紋库中与所述相似分组对应的指紋分 组中的指紋数据, 对所述待存储文件进行重复数据删除;  a deduplication module, configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar grouping in the fingerprint database;
其中, 所述分组抽样库由至少一个抽样分组构成, 所述指紋库由至少 一个指紋分组构成, 所述分组抽样库中的各抽样分组与所述指紋库中的各 的指紋抽样表中的抽样指紋相匹配的一个抽样分组。  Wherein the packet sampling library is composed of at least one sample group, the fingerprint library is composed of at least one fingerprint group, each sample group in the group sample library and a sample in each fingerprint sample table in the fingerprint library A sample group that matches the fingerprint.
本发明提供的重复数据删除方法和装置, 通过对待存储文件进行分块处理, 计算各分块的指紋, 对各分块的指紋进行抽样处理, 根据生成的指紋抽样表和 分组抽样库确定该待存储文件在分组抽样库中所属的相似分组, 并根据指紋库 中与所述相似分组对应的指紋分组中的指紋数据,对该待存储文件进行重复数 据删除处理; 本实施例对各分块的指紋进行进一步抽样处理, 先通过相似性分 析确定相似分组, 再在相似分组对应的指紋分组中进行重复数据删除处理, 缩 小了去重的查询计算量,解决了现有技术中重删时海量分块数据引入的计算量 和资源消耗巨大的问题, 缩减了重复数据删除中去重的计算量, 提升了重删性 能 附图说明 The method and device for deduplication provided by the present invention performs block processing on a file to be stored, calculates a fingerprint of each block, samples a fingerprint of each block, and determines the waiting according to the generated fingerprint sample table and the group sampling database. Store a similar group of files in the grouping sample library, and according to the fingerprint library The fingerprint data in the fingerprint group corresponding to the similar group is subjected to deduplication processing on the file to be stored; in this embodiment, the fingerprint of each block is further sampled, and similar groups are first determined by similarity analysis, and then Deduplication processing is performed in the fingerprint group corresponding to the similar group, which reduces the amount of deduplication query calculation, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the prior art is huge, and the repetition is reduced. Deduplication calculation in data deletion, improved deduplication performance
实施例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下面描述中的附图是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。 The drawings used in the embodiments or the description of the prior art are briefly described. It is obvious that the drawings in the following description are some embodiments of the present invention, and are not creative to those skilled in the art. Other drawings can also be obtained from these drawings on the premise of labor.
图 1为现有技术中重复数据删除技术的过程示意图;  1 is a schematic diagram of a process of a data deduplication technology in the prior art;
图 2为现有技术中重复数据删除技术的指紋库的结构示意图;  2 is a schematic structural diagram of a fingerprint library of a data deduplication technology in the prior art;
图 3为本发明重复数据删除方法实施例一的流程图;  3 is a flowchart of Embodiment 1 of a method for deleting data in the present invention;
图 4为本发明重复数据删除方法实施例二的流程图;  4 is a flowchart of Embodiment 2 of a method for deleting data in the present invention;
图 5为本发明重复数据删除方法实施例二中分组抽样库的结构示意图; 图 6为本发明重复数据删除装置实施例一的结构图;  5 is a schematic structural diagram of a packet sampling library in the second embodiment of the data deduplication method of the present invention; FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication device of the present invention;
图 7为本发明重复数据删除装置实施例二的结构图。 具体实施方式 为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本 发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描 述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提 下所获得的所有其他实施例, 都属于本发明保护的范围。  FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention. The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. The embodiments are a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
图 3为本发明重复数据删除方法实施例一的流程图, 如图 3所示, 本 实施例提供了一种重复数据删除方法, 可以具体包括如下步骤:  FIG. 3 is a flowchart of Embodiment 1 of the data deduplication method of the present invention. As shown in FIG. 3, the embodiment provides a method for deleting data, which may specifically include the following steps:
步骤 301 , 对待存储文件进行分块处理, 计算分块处理结果中各分块的 指紋。 Step 301: Perform block processing on the storage file, and calculate each block in the block processing result. Fingerprint.
本实施例中对于每个文件的存储均执行相同的重复数据删除方法, 文 件在存储前为待存储文件。 本步骤先对待存储文件进行分块处理, 具体的 分块处理过程可以釆用现有技术中的分块技术, 如通过变长分块算法对待 存储文件进行分块。 再计算分块处理后的得到的各分块的指紋, 具体的指 紋计算过程也可以釆用现有技术中的计算方法, 如可以釆用 shal、 md5双 哈希算法来计算各分块的指紋。  In this embodiment, the same deduplication method is performed for the storage of each file, and the file is the file to be stored before being stored. In this step, the storage file is processed into blocks, and the specific block processing process can use the blocking technology in the prior art, for example, by using the variable length blocking algorithm to block the storage file. Then calculate the fingerprint of each block obtained after the block processing. The specific fingerprint calculation process can also use the calculation method in the prior art. For example, the shal and md5 double hash algorithm can be used to calculate the fingerprint of each block. .
步骤 302 , 对所述各分块的指紋进行抽样处理, 并根据抽取到的指紋生 成所述待存储文件的指紋抽样表。  Step 302: Perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints.
在本实施例中, 为了缩减重复数据删除过程中去重的计算量, 在得到 待删除文件的各分块的指紋后, 对这些指紋进行抽样, 抽样的基本要求是 抽样结果中的指紋在待删除文件的各分块的指紋的范围内, 且抽样结果中 指紋的数量不多于待删除文件的分块指紋的数量。  In this embodiment, in order to reduce the amount of deduplication in the deduplication process, after obtaining the fingerprints of the blocks of the file to be deleted, the fingerprints are sampled, and the basic requirement of the sampling is that the fingerprint in the sampling result is waiting The range of fingerprints of each block of the deleted file is deleted, and the number of fingerprints in the sampled result is not more than the number of blocked fingerprints of the file to be deleted.
具体可以利用抽样因子对各分块的指紋进行抽样处理, 抽样因子是指 用来与文件各分块指紋进行逻辑运算的抽样特征值, 抽样处理时根据抽样 规则和逻辑运算结果, 从各分块指紋中 选部分分块指紋。 对于不同的待 存储文件来说选定不同的抽样因子, 该抽样因子可以代表该待存储文件的 特性, 此处对各分块的指紋进行抽样处理, 对指紋进行 选, 并根据抽取 到的指紋生成该待存储文件的指紋抽样表。 其中, 抽样因子可以根据待存 储文件的文件规模和分块数量来确定, 也可以釆用预先设定的固定的抽样 因子进行抽样。 该指紋抽样表中的指紋可以为通过 选后保留的可以代表 该待存储文件的特性的指紋, 从而可以减少后续指紋的存储量。  Specifically, the sampling factor can be used to sample the fingerprints of each block, and the sampling factor refers to the sampling feature value used for logical operation with each block fingerprint of the file, and the sampling process is based on the sampling rule and the logical operation result. Select a part of the fingerprint in the fingerprint. Different sampling factors are selected for different files to be stored, and the sampling factor can represent the characteristics of the file to be stored. Here, the fingerprint of each block is sampled, the fingerprint is selected, and the fingerprint is extracted according to the fingerprint. Generate a fingerprint sample table of the file to be stored. The sampling factor may be determined according to the file size and the number of blocks of the file to be stored, or may be sampled by using a predetermined fixed sampling factor. The fingerprint in the fingerprint sampling table may be a fingerprint that is retained by the selection and can represent the characteristics of the file to be stored, so that the storage amount of the subsequent fingerprint can be reduced.
或者, 本步骤中的对各分块的指紋进行抽样处理时也可以不依赖抽样 因子, 具体为: 直接将各分块的指紋中最后一个字节为 0的指紋作为抽样 处理抽取到的指紋; 或者将固定位置上的分块作为抽取到的指紋, 例如将 9 的整数倍位置上的分块作为抽取到得指紋; 或者根据预定的抽样比例进行 抽样, 例如随机抽取 5%的分块作为抽取到的指紋。  Alternatively, the sampling process of the fingerprints of each block in this step may also be independent of the sampling factor, specifically: directly taking the fingerprint of the last byte of each block fingerprint as the fingerprint extracted by the sampling process; Or use the block at the fixed position as the extracted fingerprint, for example, the block at the integer multiple of 9 is extracted as the fingerprint; or according to the predetermined sampling ratio, for example, randomly extracting 5% of the block as the extraction Fingerprint to.
步骤 303 , 根据指紋抽样表和分组抽样库, 确定所述待存储文件在分组 抽样库中所属的相似分组。  Step 303: Determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling database.
在获取到待存储文件的指紋抽样表后, 本步骤为根据该指紋抽样表和 存储系统中保存的当前的分组抽样库, 确定该待存储文件在当前的分组抽 样库中所属的相似分组。 其中, 分组抽样库由至少一个抽样分组构成, 每 个抽样分组包括具有相似性的一个或多个已存储文件的抽样指紋。 相似分 组为所述分组抽样库中与所述待存储文件的指紋抽样表中的抽样指紋相匹 配的一个抽样分组。 除了分组抽样库之外, 本实施例中还设置有一个指紋 库, 该指紋库由至少一个指紋分组构成。 各指紋分组与分组抽样库中的各 抽样分组——对应, 指紋分组中保存的指紋为经过重复数据删除处理之后 的已存储文件的指紋, 即分组抽样库的每个抽样分组中保存的是具有相似 性的各存储文件的抽样指紋, 而指紋库的每个指紋分组中保存的是这些存 储文件经过重复数据删除后的所有指紋。 若本步骤处理的待存储文件为第 一个文件, 则此时的分组抽样库为空。 本步骤具体可以为通过对指紋抽样 表中的各指紋与当前的分组抽样库进行匹配处理, 通过匹配从分组抽样库 中获取一个与待存储文件的抽样指紋相匹配的抽样分组, 即获取该待存储 文件所属的相似分组。 After obtaining the fingerprint sampling table of the file to be stored, this step is based on the fingerprint sampling table and The current packet sampling library saved in the storage system determines a similar group to which the file to be stored belongs in the current packet sampling library. Wherein, the packet sampling library is composed of at least one sample group, each sample group including a sample fingerprint of one or more stored files having similarities. The similar packet is a sample packet in the packet sampling library that matches the sample fingerprint in the fingerprint sample table of the file to be stored. In addition to the packet sampling library, a fingerprint library is also provided in the embodiment, and the fingerprint library is composed of at least one fingerprint group. Each fingerprint group is corresponding to each sample group in the grouping sample library, and the fingerprint stored in the fingerprint group is the fingerprint of the stored file after the deduplication process, that is, each sample group of the group sample library is saved The sample fingerprint of each stored file of similarity, and each fingerprint group of the fingerprint library stores all the fingerprints of the stored files after deduplication. If the file to be stored processed in this step is the first file, the packet sampling library at this time is empty. The step may be: performing matching processing on each fingerprint in the fingerprint sampling table and the current grouping sampling database, and obtaining a sampling group matching the sampling fingerprint of the file to be stored from the grouping sampling database by matching, that is, obtaining the waiting group A similar group to which the file belongs.
步骤 304 , 根据指紋库中与所述相似分组对应的指紋分组中的指紋数 据, 对所述待存储文件进行重复数据删除。  Step 304: Perform data deletion on the file to be stored according to fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
在通过上述步骤确定了待存储文件在当前的分组抽样库中所属的相似 分组后, 由于分组抽样库中的抽样分组与指紋库中的指紋分组存在——对 应关系, 则根据确定的该相似分组在指紋库中确定与其对应的指紋分组, 根据确定的该指紋分组中的指紋数据, 对待存储文件进行重复数据删除处 理。 具体的删除方法可以与现有技术中类似, 即将计算得到的待存储文件 的各分块的指紋与该确定的相似分组对应的指紋分组中保存的指紋相匹 配。 若相似分组对应的指紋分组中已保存有与一个分块相同或相似的指紋 时, 则删除该分块的数据; 若相似分组对应的指紋分组中没有与一个分块 相同或相似的指紋时, 则将该分块的数据进行存储。 由此可见, 本实施例 将重复数据删除过程中查询匹配的指紋从指紋库缩小到指紋库中的一个指 紋分组, 大大减少了查询匹配的计算量。  After determining, by the above steps, the similar group to which the file to be stored belongs in the current grouping sample library, since the sampled group in the grouping sample library and the fingerprint group in the fingerprint database have a corresponding relationship, the similar grouping is determined according to the determined Determining a fingerprint group corresponding thereto in the fingerprint database, and performing deduplication processing on the file to be stored according to the determined fingerprint data in the fingerprint group. The specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint stored in the fingerprint group corresponding to the determined similar group. If a fingerprint corresponding to a block has been saved in the fingerprint group corresponding to the similar group, the data of the block is deleted; if the fingerprint group corresponding to the similar group does not have the same or similar fingerprint as a block, The blockd data is then stored. It can be seen that the embodiment reduces the fingerprint matching the query in the deduplication process from the fingerprint database to a fingerprint packet in the fingerprint database, which greatly reduces the calculation amount of the query matching.
本实施例提供了一种重复数据删除方法, 通过对待存储文件进行分块 处理, 计算各分块的指紋, 对各分块的指紋进行抽样处理, 根据生成的指 紋抽样表和分组抽样库确定该待存储文件在分组抽样库中所属的相似分 组, 并根据指紋库中与所述相似分组对应的指紋分组中的指紋数据, 对该 待存储文件进行重复数据删除处理; 本实施例对各分块的指紋进行进一步 抽样处理, 先通过相似性分析确定相似分组, 再在相似分组对应的指紋分 组中进行重复数据删除处理, 缩小了去重的查询计算量, 解决了现有技术 中重删时海量分块数据引入的计算量和资源消耗巨大的问题, 缩减了重复 数据删除中去重的计算量, 提升了重删性能。 The embodiment provides a method for deleting data by performing block processing on a file to be stored, calculating a fingerprint of each block, sampling a fingerprint of each block, and determining the fingerprint sample table and the group sampling database according to the generated fingerprint sample table. The similarity of the file to be stored in the grouping sample library And performing a data deletion process on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; in this embodiment, the fingerprint of each block is further sampled, and the similarity is first adopted. The analysis determines the similar grouping, and performs the deduplication processing in the fingerprint group corresponding to the similar grouping, which reduces the amount of deduplication query calculation, and solves the huge calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication. The problem is to reduce the amount of deduplication in deduplication and improve the deduplication performance.
图 4为本发明重复数据删除方法实施例二的流程图, 如图 4所示, 本 实施例提供了一种重复数据删除方法, 可以具体包括如下步骤:  FIG. 4 is a flowchart of Embodiment 2 of the method for deleting data in the present invention. As shown in FIG. 4, the embodiment provides a method for deleting data, which may specifically include the following steps:
步骤 401 , 对待存储文件进行分块处理, 并计算分块处理结果中各分块 的指紋, 本步骤可以与上述步骤 301类似, 此处不再赘述。  Step 401: Perform a block processing on the storage file, and calculate a fingerprint of each block in the block processing result. This step may be similar to the foregoing step 301, and details are not described herein again.
步骤 402 , 根据待存储文件的文件特征确定抽样因子。  Step 402: Determine a sampling factor according to a file feature of the file to be stored.
在对待存储文件进行分块, 并计算得到各分块的指紋后, 本实施例对 各分块的指紋进行抽样, 本步骤为确定抽样所使用的抽样因子。 具体地, 本步骤根据待存储文件的文件特征确定抽样因子, 此处的待存储文件的文 件特征可以为待存储文件的文件规模、 分块数量等, 不同待存储文件所确 定的抽样因子可能不同。 例如, 当待存储文件的分块数量大于 100万时, 确定该待存储文件的抽样因子为 OxFFF; 当待存储文件的分块数量小于 100 万且大于 10万时, 确定该待存储文件的抽样因子为 0x3FF; 当待存储文件 的分块数量小于 10万且大于 1万时, 确定该待存储文件的抽样因子为 0x2FF; 当待存储文件的分块数量小于 1万时, 确定该待存储文件的抽样因 子为 0x3F。  After the file to be stored is divided and the fingerprint of each block is calculated, the fingerprint of each block is sampled in this embodiment. This step is to determine the sampling factor used for sampling. Specifically, the step determines a sampling factor according to the file characteristics of the file to be stored, where the file feature of the file to be stored may be the file size, the number of blocks, and the like of the file to be stored, and the sampling factors determined by different files to be stored may be different. . For example, when the number of blocks of the file to be stored is greater than 1 million, the sampling factor of the file to be stored is determined to be OxFFF; when the number of blocks of the file to be stored is less than 1 million and greater than 100,000, the sampling of the file to be stored is determined. The factor is 0x3FF; when the number of blocks of the file to be stored is less than 100,000 and greater than 10,000, the sampling factor of the file to be stored is determined to be 0x2FF; when the number of blocks of the file to be stored is less than 10,000, the file to be stored is determined. The sampling factor is 0x3F.
步骤 403 , 根据设定的抽样条件, 利用抽样因子对待存储文件的所有分 块的指紋进行抽样处理。  Step 403: Sampling the fingerprints of all the blocks of the stored file by using the sampling factor according to the set sampling condition.
在对各分块的指紋进行抽样时, 可以根据设定的抽样条件, 利用抽样 因子来进行抽样。 例如, 可以将各分块的指紋与抽样因子执行相与的操作, 判断结果是否为 0 , 如果为 0 , 则符合设定的抽样条件。  When sampling the fingerprints of each block, sampling can be performed using the sampling factor according to the set sampling conditions. For example, the fingerprint of each block can be performed in conjunction with the sampling factor to determine whether the result is 0. If it is 0, the set sampling condition is met.
步骤 404 ,将抽样结果中的各分块的指紋加入待存储文件的指紋抽样表 中。  Step 404: Add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.
通过上述步骤的抽样过程, 分别得到各分块对应的抽样结果, 将符合 抽样条件的抽样结果所对应的分块的指紋, 加入该待存储文件的指紋抽样 表中。 例如, 可以将与抽样因子相与后结果为 0的分块的指紋加入到指紋 抽样表中, 其余的可以保留在指紋原始保存的数据库中, 从而形成了该待 存储文件的指紋抽样表。 Through the sampling process of the above steps, the sampling results corresponding to each block are respectively obtained, and the fingerprint of the block corresponding to the sampling result meeting the sampling condition is added to the fingerprint sampling of the file to be stored. In the table. For example, the fingerprint of the block with the result of the sampling factor and the result of 0 may be added to the fingerprint sampling table, and the rest may be retained in the database in which the fingerprint is originally stored, thereby forming a fingerprint sampling table of the file to be stored.
步骤 405 , 判断待存储文件的指紋抽样表是否为空, 如果是, 则执行步 骤 406 , 否则执行步骤 407。  Step 405: Determine whether the fingerprint sampling table of the file to be stored is empty. If yes, execute step 406; otherwise, perform step 407.
在获取到待存储文件的指紋抽样表后, 根据该指紋抽样表和当前的分 组抽样库, 对该待存储文件进行匹配并分组存储处理。 本步骤为先判断该 待存储文件的指紋抽样表是否为空, 即判断上述抽样过程是否得到满足抽 样条件的指紋, 如果是, 则执行步骤 406 , 否则执行步骤 407。  After the fingerprint sampling table of the file to be stored is obtained, the file to be stored is matched and stored in a group according to the fingerprint sampling table and the current group sampling database. In this step, it is determined whether the fingerprint sampling table of the file to be stored is empty, that is, whether the fingerprinting process satisfies the fingerprint of the sampling condition, and if yes, step 406 is performed; otherwise, step 407 is performed.
步骤 406 ,确定所述待存储文件在分组抽样库中所属的相似分组为分组 抽样库中的预设分组, 并执行步骤 411。  Step 406: Determine a similar group to which the file to be stored belongs in the group sampling database as a preset group in the group sampling database, and perform step 411.
当待存储文件的指紋抽样表为空时, 表明上述抽样过程得到的抽样结 果均不满足抽样条件, 即该待存储文件中不存在满足抽样条件的块, 则确 定所述待存储文件在当前的分组抽样库中所属的相似分组为当前的分组抽 样库中的预设分组,本实施例的相似性分析过程结束,并执行后续步骤 411 , 在指紋库中与该预设分组对应的指紋分组中对待存储文件进行重复数据删 除处理。 该预设分组为本实施例预先设定的一个分组, 没有特定的含义, 该预设分组可以为空, 其与指紋库中一个特定的指紋分组相对应, 该特定 的指紋分组中保存的是这些抽样后指紋抽样表为空的待存储文件的指紋。 在实际抽样过程中, 存在抽样后指紋抽样表为空的特殊情况, 此处仅是对 这种特殊情况下的处理进行说明, 避免因出现这种特殊情况而导致整个流 程中断。  When the fingerprint sampling table of the file to be stored is empty, it indicates that the sampling result obtained by the sampling process does not satisfy the sampling condition, that is, the block that satisfies the sampling condition does not exist in the file to be stored, and then the file to be stored is determined to be in the current The similar grouping in the group sampling database is the preset grouping in the current grouping sampling database. The similarity analysis process in this embodiment ends, and the subsequent step 411 is executed, in the fingerprint group corresponding to the preset group in the fingerprint database. Deduplication processing of stored files. The preset grouping is a group preset in this embodiment, and has no specific meaning, the preset grouping may be empty, which corresponds to a specific fingerprint group in the fingerprint library, and the specific fingerprint group is saved in the specific fingerprint group. These sampled fingerprint samples are empty fingerprints of the files to be stored. In the actual sampling process, there is a special case where the fingerprint sampling table is empty after sampling. Here, only the processing in this special case is explained to avoid the interruption of the entire process due to such special circumstances.
步骤 407 , 将指紋抽样表中的各指紋与分组抽样库进行匹配处理。  Step 407: Perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database.
当待存储文件的指紋抽样表不为空时, 表明上述抽样过程得到了满足 抽样条件的抽样结果, 对保存在指紋抽样表中的满足抽样条件的指紋进行 匹配处理, 具体为将指紋抽样表中的各指紋与当前的分组抽样库进行匹配 处理。 在分组抽样库中的指紋以分组的形式保存, 各分组中的指紋为具有 一定相似性的一个或多个文件的抽样后的指紋。 本步骤为以当前的分组抽 样库中的各分组为单位, 将指紋抽样表中的指紋与分组抽样库中的各分组 中的指紋逐个进行比较, 可以得到各分组对应的匹配结果, 该匹配结果为 指紋抽样表中的指紋与对应分组中指紋的相似度, 例如相似度可以为与对 应分组中的指紋相同或相似的指紋数占指紋抽样表中指紋总数的比例。 When the fingerprint sampling table of the file to be stored is not empty, it indicates that the sampling process satisfies the sampling condition satisfying the sampling condition, and the fingerprints satisfying the sampling condition stored in the fingerprint sampling table are matched, specifically, the fingerprint sampling table is Each fingerprint is matched with the current packet sampling library. The fingerprints in the packet sampling library are stored in groups, and the fingerprints in each group are sampled fingerprints of one or more files having a certain similarity. In this step, the fingerprints in the fingerprint sampling table and the fingerprints in each group in the group sampling database are compared one by one in units of each group in the current grouping sampling database, and the matching result corresponding to each group can be obtained, and the matching result is obtained. For The similarity between the fingerprint in the fingerprint sampling table and the fingerprint in the corresponding group, for example, the similarity may be the ratio of the number of fingerprints that are the same as or similar to the fingerprint in the corresponding group to the total number of fingerprints in the fingerprint sampling table.
步骤 408 , 判断分组抽样库的容量是否已达到容量上限, 如果是, 则执 行步骤 409 , 否则执行步骤 410。  Step 408: Determine whether the capacity of the packet sampling library has reached the capacity upper limit. If yes, execute step 409; otherwise, perform step 410.
本步骤为判断当前的分组抽样库的容量是否已达到容量上限, 如图 5 所示为本发明重复数据删除方法实施例二中分组抽样库的结构示意图, 即 判断该分组抽样库是否已满库, 如果是, 则执行步骤 409 , 否则执行步骤 410。  In this step, it is determined whether the capacity of the current packet sampling library has reached the upper limit of the capacity. As shown in FIG. 5, the structure of the packet sampling library in the second embodiment of the deduplication method of the present invention is determined, that is, whether the packet sampling library is full or not is determined. If yes, go to step 409, otherwise go to step 410.
步骤 409 , 确定所述待存储文件在分组抽样库中所属的相似分组为, 分 组抽样库中与所述指紋抽样表中的各指紋相似度最高的分组。  Step 409: Determine, that the similar group to which the file to be stored belongs in the grouping sample database is a group with the highest similarity between each of the fingerprint sampling tables and the fingerprint sampling table.
在本实施例中, 若当前的分组抽样库的容量已达到容量上限时, 则从 匹配结果中获取当前的分组抽样库中与所述指紋抽样表中的各指紋相似度 最高的分组, 确定所述待存储文件在当前的分组抽样库中所属的相似分组 为, 当前的分组抽样库中与所述指紋抽样表中的各指紋相似度最高的分组, 并执行步骤 411。  In this embodiment, if the capacity of the current packet sampling library has reached the upper limit of the capacity, the packet with the highest similarity between each fingerprint in the current fingerprint sampling database and the fingerprint sampling table is obtained from the matching result, and the determination is made. The similar group to which the storage file belongs in the current packet sampling database is the group with the highest similarity among the fingerprint samples in the current sample sampling database, and step 411 is performed.
步骤 410 ,根据指紋抽样表中的各指紋与分组抽样库中各指紋分组的指 紋相似度, 确定所述待存储文件在分组抽样库中所属的相似分组。  Step 410: Determine, according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the group sampling database, a similar group to which the file to be stored belongs in the grouping sample database.
通过对指紋抽样表中的各指紋与当前的分组抽样库中各分组的逐一匹 配, 根据指紋抽样表中的各指紋与当前的分组抽样库中各指紋分组的指紋 相似度, 确定所述待存储文件在所述当前的分组抽样库中所属的相似分组。 具体地, 当指紋抽样表中的各指紋与当前的分组抽样库中的一个抽样分组 的指紋相似度大于或等于预设的相似度阔值时, 则认为该待存储文件属于 该抽样分组, 直接确定所述待存储文件在所述当前的分组抽样库中所属的 相似分组为所述抽样分组, 并执行步骤 411。 在本实施例中, 在进行指紋匹 配时, 当出现第一个满足上述相似度条件的抽样分组时, 便将该抽样分组 作为相似性分析选中的相似分组, 不再进行后续与其他抽样分组的匹配过 程, 则可以减少相似性分析算法的计算量, 也提升了相似性分析算法的性 能。 当指紋抽样表中的各指紋与当前的分组抽样库中的所有分组的指紋相 似度均小于预设的相似度阈值时, 则表明该待存储文件不属于当前的分组 抽样库中的任何分组, 则在当前的分组抽样库中建立一个新建分组, 确定 待存储文件在所述当前的分组抽样库中所属的相似分组为所述新建分组, 并将所述待存储文件的指紋抽样表中的指紋保存到新建分组中, 并执行步 骤 411。 在本实施例中, 通过逐个匹配各个分组的指紋, 当出现第一个满足 相似度阔值的分组时, 则将该分组作为相似性分析选中的分组, 无需再对 后续的分组进行匹配, 可见, 本实施例明显减少了相似性分析算法的计算 量。 By matching each fingerprint in the fingerprint sampling table with each group in the current grouping sampling database, determining the to-be-storage according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the current grouping sampling database A similar group to which the file belongs in the current grouping of sample samples. Specifically, when the fingerprint similarity between the fingerprints in the fingerprint sampling table and one sampling group in the current grouping sampling database is greater than or equal to a preset similarity threshold, the file to be stored is considered to belong to the sampling group, directly Determining that the similar group to which the file to be stored belongs in the current packet sampling library is the sample group, and performing step 411. In this embodiment, when the fingerprint matching is performed, when the first sampling group that satisfies the above similarity condition occurs, the sampling group is used as the similar group selected by the similarity analysis, and the subsequent sampling with other sampling groups is no longer performed. The matching process can reduce the computational complexity of the similarity analysis algorithm and improve the performance of the similarity analysis algorithm. When the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the current packet sampling database is less than a preset similarity threshold, it indicates that the file to be stored does not belong to any packet in the current packet sampling library. Then create a new group in the current grouping sample library, determine The similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the newly created group, and step 411 is performed. In this embodiment, by matching the fingerprints of the respective groups one by one, when the first packet satisfying the similarity threshold occurs, the packet is selected as the group selected by the similarity analysis, and the subsequent packets need not be matched. This embodiment significantly reduces the amount of calculation of the similarity analysis algorithm.
进一步地, 在本实施例中, 当待存储文件为待存储的第一个文件时, 根据指紋抽样表和当前的分组抽样库确定所述待存储文件在所述当前的分 组抽样库中所属的相似分组: 在当前的分组抽样库中建立一个新建分组, 确定所述待存储文件在所述当前的分组抽样库中所属的相似分组为所述新 建分组, 并将所述待存储文件的指紋抽样表中的指紋保存到所述新建分组 中。  Further, in this embodiment, when the file to be stored is the first file to be stored, determining, according to the fingerprint sampling table and the current grouping sample database, the file to be stored belongs to the current grouping sample library. Similar grouping: establishing a new grouping in the current grouping sample library, determining that the similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and sampling the fingerprint of the file to be stored The fingerprints in the table are saved to the new group.
步骤 411 , 根据指紋库中与相似分组对应的指紋分组中的指紋数据, 对 所述待存储文件进行重复数据删除。  Step 411: Perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
在通过上述步骤确定了待存储文件在当前的分组抽样库中所属的相似 分组后, 根据指紋库中与相似分组对应的指紋分组中的指紋数据, 对待存 储文件进行重复数据删除处理。 具体的删除方法可以与现有技术中类似, 即将计算得到的待存储文件的各分块的指紋与该相似分组对应的指紋分组 中保存的指紋相匹配。 若相似分组对应的指紋分组中已保存有与一个分块 相同或相似的指紋时, 则删除该分块的数据; 若相似分组对应的指紋分组 中没有与一个分块相同或相似的指紋时, 则将该分块的数据进行存储。 由 此可见, 本实施例将重复数据删除过程中查询匹配的指紋范围从整个指紋 库缩小到指紋库中的一个指紋分组, 大大减少了查询匹配的计算量。  After determining the similar group to which the file to be stored belongs in the current packet sampling library through the above steps, the data to be stored in the fingerprint database corresponding to the similar grouping in the fingerprint database is subjected to deduplication processing. The specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint saved in the fingerprint group corresponding to the similar group. If a fingerprint corresponding to a block has been saved in the fingerprint group corresponding to the similar group, the data of the block is deleted; if the fingerprint group corresponding to the similar group does not have the same or similar fingerprint as a block, The blockd data is then stored. It can be seen that, in this embodiment, the fingerprint matching range of the query in the deduplication process is reduced from the entire fingerprint database to a fingerprint group in the fingerprint database, which greatly reduces the calculation amount of the query matching.
本实施例提供了一种重复数据删除方法, 通过对待存储文件进行分块 处理, 并计算各分块的指紋, 利用抽样因子对指紋进行抽样处理, 根据生 成的指紋抽样表和当前的分组抽样库确定该待存储文件在当前的分组抽样 库中所属的相似分组, 并根据指紋库中与所述相似分组对应的指紋分组中 的指紋数据, 对该待存储文件进行重复数据删除处理; 本实施例对各分块 的指紋进行进一步抽样处理, 先通过相似性分析确定相似分组, 再在相似 分组对应的指紋分组中进行重复数据删除处理, 缩小了去重的查询计算量, 解决了现有技术中重删时海量分块数据引入的计算量和资源消耗巨大的问 题, 缩减了重复数据删除中去重的计算量, 提升了重删性能。 This embodiment provides a method for deleting data by performing block processing on a file to be stored, and calculating a fingerprint of each block, and sampling the fingerprint by using a sampling factor, according to the generated fingerprint sample table and the current grouping sample database. Determining a similar group to which the file to be stored belongs in the current packet sampling database, and performing deduplication processing on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; The fingerprints of each block are further sampled, and the similarity group is first determined by the similarity analysis, and then the data deletion processing is performed in the fingerprint group corresponding to the similar group, thereby reducing the calculation amount of the deduplication query. The problem that the calculation amount and the resource consumption introduced by the massive block data in the deduplication in the prior art is solved is solved, the dequantization calculation amount in the deduplication is reduced, and the deduplication performance is improved.
本领域普通技术人员可以理解: 实现上述方法实施例的全部或部分步 骤可以通过程序指令相关的硬件来完成, 前述的程序可以存储于一计算机 可读取存储介质中, 该程序在执行时, 执行包括上述方法实施例的步骤; 而前述的存储介质包括: ROM、 RAM, 磁碟或者光盘等各种可以存储程序 代码的介质。  A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
图 6为本发明重复数据删除装置实施例一的结构图, 如图 6所示, 本 实施例提供了一种重复数据删除装置, 可以具体执行上述方法实施例一中 的各个步骤, 此处不再赘述。 本实施例提供的重复数据删除装置可以具体 包括分块模块 601、 抽样模块 602、 分组模块 603和重复数据删除模块 604。 其中, 分块模块 601用于对待存储文件进行分块处理, 计算分块处理结果 中各分块的指紋。 抽样模块 602用于对所述各分块的指紋进行抽样处理, 并根据抽取到的指紋生成所述待存储文件的指紋抽样表。 分组模块 603用 于根据指紋抽样表和分组抽样库, 确定所述待存储文件在分组抽样库中所 属的相似分组。 重复数据删除模块 604用于根据指紋库中与所述相似分组 对应的指紋分组中的指紋数据, 对所述待存储文件进行重复数据删除。 其 中, 所述分组抽样库由至少一个抽样分组构成, 所述指紋库由至少一个指 紋分组构成, 所述分组抽样库中的各抽样分组与所述指紋库中的各指紋分 组——对应, 所述相似分组为所述分组抽样库中与所述待存储文件的指紋 抽样表中的抽样指紋相匹配的一个抽样分组  FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication apparatus of the present invention. As shown in FIG. 6, the embodiment provides a deduplication apparatus, which may specifically perform the steps in Embodiment 1 of the foregoing method. Let me repeat. The deduplication apparatus provided in this embodiment may specifically include a blocking module 601, a sampling module 602, a grouping module 603, and a deduplication module 604. The blocking module 601 is configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result. The sampling module 602 is configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints. The grouping module 603 is configured to determine a similar grouping of the file to be stored in the grouping sample library based on the fingerprint sampling table and the grouping sampling database. The deduplication module 604 is configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database. The packet sampling library is composed of at least one sample group, the fingerprint library is composed of at least one fingerprint group, and each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library. The similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
图 7为本发明重复数据删除装置实施例二的结构图, 如图 7所示, 本 实施例提供了一种重复数据删除装置, 可以具体执行上述方法实施例二中 的各个步骤, 此处不再赘述。 本实施例提供的重复数据删除装置在上述图 6 所示的基础之上, 抽样模块 602可以具体包括确定单元 612、 抽样单元 622 和生成单元 632。 其中, 确定单元 612用于根据待存储文件的文件特征, 确 定抽样因子, 所述文件特征包括所述待存储文件的文件规模和分块数量。 抽样单元 622用于根据设定的抽样条件, 利用所述抽样因子对所述待存储 文件的所有分块的指紋进行抽样处理。 生成单元 632用于将抽样结果中的 各分块的指紋加入所述待存储文件的指紋抽样表中。 具体地, 本实施例中的分组模块 603可以具体包括第一分组单元 613、 匹配单元 623和第二分组单元 633。 其中, 第一分组单元 613用于当所述指 紋抽样表为空时, 确定所述待存储文件在所述分组抽样库中所属的相似分 组为分组抽样库中的预设分组。 匹配单元 623用于当所述指紋抽样表不为 空时, 将所述指紋抽样表中的各指紋与分组抽样库进行匹配处理。 第二分 组单元 633用于根据匹配结果, 确定所述待存储文件在所述分组抽样库中 所属的相似分组。 FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention. As shown in FIG. 7, the embodiment provides a deduplication apparatus, which may specifically perform the steps in the second embodiment of the foregoing method. Let me repeat. The deduplication apparatus provided in this embodiment is based on the above-described FIG. 6, and the sampling module 602 may specifically include a determining unit 612, a sampling unit 622, and a generating unit 632. The determining unit 612 is configured to determine a sampling factor according to a file feature of the file to be stored, where the file feature includes a file size and a number of blocks of the file to be stored. The sampling unit 622 is configured to perform sampling processing on the fingerprints of all the blocks of the file to be stored by using the sampling factor according to the set sampling condition. The generating unit 632 is configured to add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored. Specifically, the grouping module 603 in this embodiment may specifically include a first grouping unit 613, a matching unit 623, and a second grouping unit 633. The first grouping unit 613 is configured to determine, when the fingerprint sampling table is empty, that the similar group to which the file to be stored belongs in the grouping sample library is a preset group in the grouping sample database. The matching unit 623 is configured to perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database when the fingerprint sampling table is not empty. The second grouping unit 633 is configured to determine, according to the matching result, a similar group to which the file to be stored belongs in the grouping sample library.
进一步地, 本实施例中的第二分组单元 633可以具体包括第一分组子 单元 6331、 第二分组子单元 6332和第三分组子单元 6333。 其中, 第一分 组子单元 6331用于若所述分组抽样库的容量已达到容量上限, 确定所述待 存储文件在分组抽样库中所属的相似分组为, 所述分组抽样库中与所述指 紋抽样表中的各指紋相似度最高的分组。 第二分组子单元 6332用于若所述 分组抽样库的容量未达到容量上限, 且当所述指紋抽样表中的各指紋与所 述分组抽样库中的一个抽样分组的指紋相似度大于或等于预设的相似度阔 值时, 确定所述待存储文件在分组抽样库中所属的相似分组为所述抽样分 组。 第三分组子单元 6333用于若所述分组抽样库的容量未达到容量上限, 且当所述指紋抽样表中的各指紋与所述分组抽样库中的所有分组的指紋相 似度均小于预设的相似度阈值时, 在所述分组抽样库中建立一个新建分组, 确定所述待存储文件在分组抽样库中所属的相似分组为所述新建分组, 并 将所述待存储文件的指紋抽样表中的指紋保存到所述新建分组中。  Further, the second grouping unit 633 in this embodiment may specifically include a first grouping subunit 6331, a second grouping subunit 6332, and a third grouping subunit 6333. The first grouping subunit 6331 is configured to determine, if the capacity of the packet sampling library has reached a capacity upper limit, a similar grouping that the file to be stored belongs to in the grouping sample database, and the fingerprint in the grouping sampling database The group with the highest similarity of each fingerprint in the sampling table. The second grouping sub-unit 6332 is configured to: if the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint in the fingerprint sampling table and the fingerprint group of the sample sampling group have a similarity of the fingerprint greater than or equal to When the preset similarity threshold is used, it is determined that the similar group to which the file to be stored belongs in the packet sampling library is the sample group. The third grouping sub-unit 6333 is configured to: if the capacity of the grouping sample library does not reach the upper limit of the capacity, and the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the groups in the grouping sample library is less than a preset a similarity threshold, establishing a new group in the grouping sample library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and sampling the fingerprint of the file to be stored The fingerprint in is saved to the new group.
具体地, 当所述待存储文件为待存储的第一个文件时, 本实施例中的 分组模块 603还可以具体用于在所述分组抽样库中建立一个新建分组, 确 定所述待存储文件在分组抽样库中所属的相似分组为所述新建分组, 并将 所述待存储文件的指紋抽样表中的指紋保存到所述新建分组中。  Specifically, when the file to be stored is the first file to be stored, the grouping module 603 in this embodiment may be specifically configured to establish a new group in the grouping sample library, and determine the file to be stored. The similar group to which the packet sampling library belongs is the new group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the new group.
本实施例提供了一种重复数据删除装置, 通过对待存储文件进行分块 处理, 计算各分块的指紋, 对指紋进行抽样处理; 根据生成的指紋抽样表 和分组抽样库, 确定该待存储文件在分组抽样库中所属的相似分组; 并根 据指紋库中与所述相似分组对应的指紋分组中的指紋数据, 对该待存储文 件进行重复数据删除处理; 本实施例对各分块的指紋进行进一步抽样处理, 先通过相似性分析确定相似分组, 再在相似分组对应的指紋分组中进行重 复数据删除处理, 缩小了去重的查询计算量, 解决了现有技术中重删时海 量分块数据引入的计算量和资源消耗巨大的问题, 缩减了重复数据删除中 去重的计算量, 提升了重删性能。 The embodiment provides a data deduplication device, which performs block processing on the file to be stored, calculates a fingerprint of each block, and performs sampling processing on the fingerprint; and determines the file to be stored according to the generated fingerprint sample table and the group sampling database. And the similar groupings in the grouping of the groupings; and performing the data deletion processing on the files to be stored according to the fingerprint data in the fingerprint group corresponding to the similar groupings in the fingerprint database; For further sampling processing, similarity analysis is first determined by similarity analysis, and then weighted in fingerprint groups corresponding to similar groups. The data deletion processing reduces the dequantization query calculation amount, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication is large, and the calculation amount of deduplication in the deduplication is reduced. Improved deduplication performance.
最后应说明的是: 以上实施例仅用以说明本发明的技术方案, 而非对其限 制; 尽管参照前述实施例对本发明进行了详细的说明, 本领域的普通技术人员 应当理解: 其依然可以对前述各实施例所记载的技术方案进行修改, 或者对其 中部分技术特征进行等同替换; 而这些修改或者替换, 并不使相应技术方案的 本质脱离本发明各实施例技术方案的范围。  It should be noted that the above embodiments are only for explaining the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: The technical solutions described in the foregoing embodiments are modified, or some of the technical features are equivalently replaced; and the modifications or substitutions do not deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

权利要求 Rights request
1、 一种重复数据删除方法, 其特征在于, 包括: A method for deduplication, characterized in that it comprises:
对待存储文件进行分块处理, 计算分块处理结果中各分块的指紋; 对这些指紋进行抽样, 并根据抽取到的指紋生成所述待存储文件的 指紋抽样表;  Performing a block processing on the storage file, calculating a fingerprint of each block in the block processing result; sampling the fingerprints, and generating a fingerprint sampling table of the file to be stored according to the extracted fingerprint;
根据指紋抽样表和分组抽样库, 确定所述待存储文件在所述分组抽 样库中所属的相似分组;  Determining, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling library;
根据指紋库中与所述相似分组对应的指紋分组中的指紋数据, 对所 述待存储文件进行重复数据删除;  Deduplicating the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database;
其中, 所述分组抽样库由至少一个抽样分组构成, 所述指紋库由至 少一个指紋分组构成, 指紋库的每个指紋分组中保存的是存储文件经过 重复数据删除后的所有指紋, 所述分组抽样库中的各抽样分组与所述指 紋库中的各指紋分组——对应, 每个抽样分组包括具有相似性的一个或 多个已存储文件的抽样指紋, 所述相似分组为所述分组抽样库中与所述 待存储文件的指紋抽样表中的抽样指紋相匹配的一个抽样分组。  The packet sampling library is composed of at least one sample group, and the fingerprint library is composed of at least one fingerprint group. Each fingerprint group of the fingerprint library stores all fingerprints after the data file is deduplicated. Each sample group in the sample library corresponds to each fingerprint group in the fingerprint library - each sample group includes sample fingerprints of one or more stored files having similarities, and the similar group is sampled by the group A sample group in the library that matches the sample fingerprint in the fingerprint sample table of the file to be stored.
2、 根据权利要求 1所述的方法, 其特征在于, 所述对所述各分块的 指紋进行抽样处理, 并根据抽取到的指紋生成所述待存储文件的指紋抽 样表, 包括:  The method according to claim 1, wherein the sampling of the fingerprints of the blocks is performed, and the fingerprint sampling table of the file to be stored is generated according to the extracted fingerprints, including:
根据所述待存储文件的文件特征, 确定抽样因子, 所述文件特征包 括所述待存储文件的文件规模和分块数量;  Determining, according to a file feature of the file to be stored, a sampling factor, where the file feature includes a file size and a number of blocks of the file to be stored;
根据设定的抽样条件, 利用所述抽样因子对所述待存储文件的所有 分块的指紋进行抽样处理;  And sampling, by using the sampling factor, the fingerprints of all the blocks of the file to be stored according to the set sampling condition;
将抽样结果中的各分块的指紋加入所述待存储文件的指紋抽样表 中。  The fingerprint of each block in the sampling result is added to the fingerprint sampling table of the file to be stored.
3、 根据权利要求 1或 2所述的方法, 其特征在于, 所述根据指紋抽 样表和分组抽样库, 确定所述待存储文件在所述分组抽样库中所属的相 似分组, 包括:  The method according to claim 1 or 2, wherein the determining, according to the fingerprint sampling table and the packet sampling library, the similar grouping of the file to be stored in the grouping sample library, comprising:
当所述指紋抽样表为空时, 确定所述待存储文件在所述分组抽样库 中所属的相似分组为分组抽样库中的预设分组;  When the fingerprint sampling table is empty, determining that the similar group to which the file to be stored belongs in the grouping sample library is a preset group in the group sampling database;
当所述指紋抽样表不为空时, 将所述指紋抽样表中的各指紋与分组 抽样库进行匹配处理; 根据匹配结果, 确定所述待存储文件在所述分组 抽样库中所属的相似分组。 When the fingerprint sampling table is not empty, each fingerprint in the fingerprint sampling table is grouped The sampling library performs a matching process; and according to the matching result, determining a similar group to which the file to be stored belongs in the grouping sample library.
4、 根据权利要求 3所述的方法, 其特征在于, 所述根据匹配结果, 确定所述待存储文件在所述分组抽样库中所属的相似分组, 包括:  The method according to claim 3, wherein the determining, according to the matching result, the similar group to which the file to be stored belongs in the grouping sample database comprises:
若所述分组抽样库的容量已达到容量上限, 确定所述待存储文件在 所述分组抽样库中所属的相似分组为, 所述分组抽样库中与所述指紋抽 样表中的各指紋相似度最高的分组;  If the capacity of the packet sampling library has reached the upper limit of the capacity, determining that the similar group to which the file to be stored belongs in the grouping sample library is, the similarity of each fingerprint in the grouping sample database and the fingerprint sampling table Highest grouping;
若所述分组抽样库的容量未达到容量上限, 且当所述指紋抽样表中 的各指紋与所述分组抽样库中一个抽样分组的指紋相似度大于或等于预 设的相似度阈值时, 确定所述待存储文件在所述分组抽样库中所属的相 似分组为所述抽样分组;  If the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint similarity between the fingerprints in the fingerprint sampling table and one sample group in the packet sampling database is greater than or equal to a preset similarity threshold, And the similar group to which the file to be stored belongs in the grouping sample library is the sample group;
若所述分组抽样库的容量未达到容量上限, 且当所述指紋抽样表中 的各指紋与所述分组抽样库中的所有分组的指紋相似度均小于预设的相 似度阈值时, 在所述分组抽样库中建立一个新建分组, 确定所述待存储 文件在所述分组抽样库中所属的相似分组为所述新建分组, 并将所述待 存储文件的指紋抽样表中的指紋保存到所述新建分组中。  If the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the packet sampling database is less than a preset similarity threshold, Establishing a new group in the grouping sample library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and saving the fingerprint in the fingerprint sampling table of the file to be stored to the In the new grouping.
5、 根据权利要求 1所述的方法, 其特征在于, 当所述待存储文件为 待存储的第一个文件时, 所述根据指紋抽样表和分组抽样库, 确定所述 待存储文件在所述分组抽样库中所属的相似分组, 具体为:  The method according to claim 1, wherein when the file to be stored is the first file to be stored, the determining the file to be stored according to the fingerprint sampling table and the packet sampling database The similar groupings in the grouping sample library are as follows:
在所述分组抽样库中建立一个新建分组, 确定所述待存储文件在所 述分组抽样库中所属的相似分组为所述新建分组, 并将所述待存储文件 的指紋抽样表中的指紋保存到所述新建分组中。  Establishing a new group in the packet sampling library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and storing the fingerprint in the fingerprint sampling table of the file to be stored Go to the new group.
6、 一种重复数据删除装置, 其特征在于, 包括:  6. A data deduplication device, comprising:
分块模块, 用于对待存储文件进行分块处理, 计算分块处理结果中 各分块的指紋;  a blocking module, configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result;
抽样模块, 用于对这些指紋进行抽样, 并根据抽取到的指紋生成所 述待存储文件的指紋抽样表;  a sampling module, configured to sample the fingerprints, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints;
分组模块, 用于根据指紋抽样表和分组抽样库, 确定所述待存储文 件在所述分组抽样库中所属的相似分组;  a grouping module, configured to determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the grouping sample library;
重复数据删除模块, 用于根据指紋库中与所述相似分组对应的指紋 分组中的指紋数据, 对所述待存储文件进行重复数据删除; 其中, 所述分组抽样库由至少一个抽样分组构成, 所述指紋库由至 少一个指紋分组构成, 指紋库的每个指紋分组中保存的是存储文件经过 重复数据删除后的所有指紋, 所述分组抽样库中的各抽样分组与所述指 紋库中的各指紋分组——对应, 每个抽样分组包括具有相似性的一个或 多个已存储文件的抽样指紋, 所述相似分组为所述分组抽样库中与所述 待存储文件的指紋抽样表中的抽样指紋相匹配的一个抽样分组。 a data deduplication module, configured to perform deduplication on the file to be stored according to fingerprint data in a fingerprint group corresponding to the similar group in the fingerprint database; The packet sampling library is composed of at least one sample group, and the fingerprint library is composed of at least one fingerprint group. Each fingerprint group of the fingerprint library stores all fingerprints after the data file is deduplicated. Each sample group in the sample library corresponds to each fingerprint group in the fingerprint library - each sample group includes sample fingerprints of one or more stored files having similarities, and the similar group is sampled by the group A sample group in the library that matches the sample fingerprint in the fingerprint sample table of the file to be stored.
7、 根据权利要求 6所述的装置, 其特征在于, 所述抽样模块包括: 确定单元, 用于根据所述待存储文件的文件特征, 确定抽样因子, 所述文件特征包括所述待存储文件的文件规模和分块数量;  The device according to claim 6, wherein the sampling module comprises: a determining unit, configured to determine a sampling factor according to a file feature of the file to be stored, the file feature comprising the file to be stored File size and number of chunks;
抽样单元, 用于根据设定的抽样条件, 利用所述抽样因子对所述待 存储文件的所有分块的指紋进行抽样处理;  a sampling unit, configured to sample, by using the sampling factor, a fingerprint of all the blocks of the file to be stored according to the set sampling condition;
生成单元, 用于将抽样结果中的各分块的指紋加入所述待存储文件 的指紋抽样表中。  And a generating unit, configured to add a fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.
8、 根据权利要求 6或 7所述的装置, 其特征在于, 所述分组模块包 括:  8. The apparatus according to claim 6 or 7, wherein the grouping module comprises:
第一分组单元, 用于当所述指紋抽样表为空时, 确定所述待存储文 件在所述分组抽样库中所属的相似分组为分组抽样库中的预设分组; 匹配单元, 用于当所述指紋抽样表不为空时, 将所述指紋抽样表中 的各指紋与分组抽样库进行匹配处理;  a first grouping unit, configured to: when the fingerprint sampling table is empty, determine that a similar group to which the file to be stored belongs in the grouping sample library is a preset group in a group sampling database; a matching unit, configured to be used When the fingerprint sampling table is not empty, matching each fingerprint in the fingerprint sampling table with the group sampling database;
第二分组单元, 用于根据所述匹配单元的匹配结果, 确定所述待存 储文件在所述分组抽样库中所属的相似分组。  And a second grouping unit, configured to determine, according to a matching result of the matching unit, a similar group to which the to-be-stored file belongs in the grouping sample database.
9、 根据权利要求 8所述的装置, 其特征在于, 所述第二分组单元包 括:  9. The apparatus according to claim 8, wherein the second grouping unit comprises:
第一分组子单元, 用于若所述分组抽样库的容量已达到容量上限, 确定所述待存储文件在所述分组抽样库中所属的相似分组为, 所述分组 抽样库中与所述指紋抽样表中的各指紋相似度最高的分组;  a first grouping subunit, configured to determine, if the capacity of the packet sampling library has reached a capacity upper limit, a similar grouping of the file to be stored in the grouping sample library, the fingerprinting library and the fingerprint The group with the highest similarity of each fingerprint in the sampling table;
第二分组子单元, 用于若所述分组抽样库的容量未达到容量上限, 且当所述指紋抽样表中的各指紋与所述分组抽样库中的一个抽样分组的 指紋相似度大于或等于预设的相似度阈值时, 确定所述待存储文件在所 述分组抽样库中所属的相似分组为所述抽样分组;  a second grouping subunit, configured to: if a capacity of the grouping sample library does not reach a capacity upper limit, and when a fingerprint similarity between each fingerprint in the fingerprint sampling table and a sample group in the grouping sample library is greater than or equal to Determining, by the preset similarity threshold, a similar group to which the file to be stored belongs in the grouping sample library is the sample group;
第三分组子单元, 用于若所述分组抽样库的容量未达到容量上限, 且当所述指紋抽样表中的各指紋与所述分组抽样库中的所有分组的指紋 相似度均小于预设的相似度阈值时, 在所述分组抽样库中建立一个新建 分组, 确定所述待存储文件在所述分组抽样库中所属的相似分组为所述 新建分组, 并将所述待存储文件的指紋抽样表中的指紋保存到所述新建 分组中。 a third grouping subunit, configured to: if the capacity of the packet sampling library does not reach a capacity upper limit, And when the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the packet sampling library is less than a preset similarity threshold, establishing a new group in the group sampling database, determining the The similar group to which the file to be stored belongs in the packet sampling library is the new group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the new group.
10、 根据权利要求 6所述的装置, 其特征在于, 当所述待存储文件 为待存储的第一个文件时, 所述分组模块具体用于在所述分组抽样库中 建立一个新建分组, 确定所述待存储文件在所述分组抽样库中所属的相 似分组为所述新建分组, 并将所述待存储文件的指紋抽样表中的指紋保 存到所述新建分组中。  The apparatus according to claim 6, wherein when the file to be stored is the first file to be stored, the grouping module is specifically configured to establish a new group in the grouping sample library, Determining a similar group to which the file to be stored belongs in the packet sampling library is the new group, and saving a fingerprint in the fingerprint sampling table of the file to be stored into the new group.
PCT/CN2012/085278 2011-11-25 2012-11-26 Duplicate data deletion method and device WO2013075668A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110380773.3 2011-11-25
CN201110380773.3A CN103150260B (en) 2011-11-25 2011-11-25 Data de-duplication method and device

Publications (1)

Publication Number Publication Date
WO2013075668A1 true WO2013075668A1 (en) 2013-05-30

Family

ID=48469137

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/085278 WO2013075668A1 (en) 2011-11-25 2012-11-26 Duplicate data deletion method and device

Country Status (2)

Country Link
CN (1) CN103150260B (en)
WO (1) WO2013075668A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941605A (en) * 2019-11-07 2020-03-31 北京浪潮数据技术有限公司 Method and device for deleting repeated data on line and readable storage medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015061995A1 (en) * 2013-10-30 2015-05-07 华为技术有限公司 Data processing method, device, and duplication processor
CN103631933B (en) * 2013-12-06 2017-04-12 中国科学院计算技术研究所 Distributed duplication elimination system-oriented data routing method
WO2015100639A1 (en) * 2013-12-31 2015-07-09 华为技术有限公司 De-duplication method, apparatus and system
CN103995863B (en) * 2014-05-19 2018-06-19 华为技术有限公司 A kind of method and device of data de-duplication
WO2017214793A1 (en) * 2016-06-13 2017-12-21 北京小米移动软件有限公司 Fingerprint template generation method and apparatus
CN106409317B (en) * 2016-09-29 2020-02-07 北京小米移动软件有限公司 Method and device for extracting dream speech
CN107451204B (en) * 2017-07-10 2021-01-05 创新先进技术有限公司 Data query method, device and equipment
CN108280628A (en) * 2018-02-01 2018-07-13 泰康保险集团股份有限公司 Core based on block chain technology pays for method, apparatus, medium and electronic equipment
CN111488269B (en) * 2019-01-29 2023-11-14 阿里巴巴集团控股有限公司 Index detection method, device and system for data warehouse
CN116991329B (en) * 2023-09-25 2023-12-08 深圳市明泰智能技术有限公司 Data redundancy prevention method and system for self-service terminal equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008062145A1 (en) * 2006-11-22 2008-05-29 Half Minute Media Limited Creating fingerprints
CN101374234A (en) * 2008-09-25 2009-02-25 清华大学 Method and apparatus for monitoring video copy base on content
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214210B (en) * 2011-05-16 2013-03-13 华为数字技术(成都)有限公司 Method, device and system for processing repeating data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008062145A1 (en) * 2006-11-22 2008-05-29 Half Minute Media Limited Creating fingerprints
CN101374234A (en) * 2008-09-25 2009-02-25 清华大学 Method and apparatus for monitoring video copy base on content
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941605A (en) * 2019-11-07 2020-03-31 北京浪潮数据技术有限公司 Method and device for deleting repeated data on line and readable storage medium
CN110941605B (en) * 2019-11-07 2022-07-08 北京浪潮数据技术有限公司 Method and device for deleting repeated data on line and readable storage medium

Also Published As

Publication number Publication date
CN103150260A (en) 2013-06-12
CN103150260B (en) 2016-06-08

Similar Documents

Publication Publication Date Title
WO2013075668A1 (en) Duplicate data deletion method and device
EP2940598B1 (en) Data object processing method and device
US9223794B2 (en) Method and apparatus for content-aware and adaptive deduplication
WO2013086969A1 (en) Method, device and system for finding duplicate data
WO2014094479A1 (en) Method and device for deleting duplicate data
Wang et al. Research on a clustering data de-duplication mechanism based on Bloom Filter
Bo et al. Research on chunking algorithms of data de-duplication
WO2017096532A1 (en) Data storage method and apparatus
WO2012065408A1 (en) Disaster tolerance data backup method and system
WO2014037767A1 (en) Multi-level inline data deduplication
WO2014067063A1 (en) Duplicate data retrieval method and device
US10509771B2 (en) System and method for data storage, transfer, synchronization, and security using recursive encoding
Bhalerao et al. A survey: On data deduplication for efficiently utilizing cloud storage for big data backups
CN108415671B (en) Method and system for deleting repeated data facing green cloud computing
WO2014000458A1 (en) Small file processing method and device
Xu et al. A lightweight virtual machine image deduplication backup approach in cloud environment
CN103152430A (en) Cloud storage method for reducing data-occupied space
WO2021082926A1 (en) Data compression method and apparatus
Zhou et al. Hysteresis re-chunking based metadata harnessing deduplication of disk images
Kim et al. Design and implementation of binary file similarity evaluation system
CN112162973A (en) Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system
JP6113816B1 (en) Information processing system, information processing apparatus, and program
US10162832B1 (en) Data aware deduplication
CN104281412A (en) Method for removing repeating data before data storage
WO2018036290A1 (en) Data compression method and terminal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12851328

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12851328

Country of ref document: EP

Kind code of ref document: A1