WO2013075668A1

WO2013075668A1 - Duplicate data deletion method and device

Info

Publication number: WO2013075668A1
Application number: PCT/CN2012/085278
Authority: WO
Inventors: 付旭东; 徐君
Original assignee: 华为技术有限公司
Priority date: 2011-11-25
Filing date: 2012-11-26
Publication date: 2013-05-30
Also published as: CN103150260A; CN103150260B

Abstract

Provided is a duplicate data deletion method and device. The method includes: partitioning a file to be stored, and calculating a fingerprint of each partition in the partitioning processing result; sampling the fingerprint of each partition, and generating a fingerprint sampling table for the file to be stored according to the sampled fingerprint; determining a similar grouping of the file to be stored in a grouping sampling library according to the fingerprint sampling table and the grouping sampling library; and performing duplicate data deletion on the file to be stored according to the fingerprint data in a fingerprint grouping corresponding to the similar grouping in a fingerprint library. The device includes: a partitioning module, a sampling module, a grouping module and a duplicate data deletion module. The present invention solves the problem in the prior art that the calculation amount and the resource consumption introduced by massive partitioned data during duplicate deletion are huge and reduces the calculation amount of de-duplication during duplicate data deletion.

Description

Deduplication method and device This application is filed on November 25, 2011 and submitted to the China Patent Office.

The priority of the Chinese Patent Application, the entire disclosure of which is hereby incorporated by reference. The present invention relates to the field of data storage technologies, and in particular, to a data deletion method and apparatus. BACKGROUND OF THE INVENTION Deduplication (referred to as deduplication) is a data reduction technique commonly used in disk-based backup systems to reduce the storage capacity used in storage systems. Generally, deduplication technology deals with large data volume scenarios. The industry's deduplication technology mainly includes techniques such as block detection, similarity detection, and delta coding. Among them, similarity detection and Delta coding are the other two. A data compression method, but only detects similar files, and the deduplication rate is low. Deduplication is an important indicator of the effectiveness of deduplication and identifies the rate at which duplicate data is removed.

FIG. 1 is a schematic diagram of a process of the data deduplication technology in the prior art. As shown in FIG. 1 , in the prior art deduplication technology, a huge block data index table is established in the memory to maintain the index of the block data. . When performing deduplication, the data object is subjected to block processing, and the fingerprint of each block in the block processing result is calculated, and the fingerprint of each block is stored in the block data index table (ie, the fingerprint database), as shown in FIG. 2 A schematic diagram of the structure of a fingerprint library of the prior art deduplication technology is shown. When the data is stored, the data index table is first queried. If the fingerprint of the same data as the fingerprint of the data to be stored is queried, the data to be stored is not stored, and only the new fingerprint in the block data index table is not queried. Block, thereby avoiding the block of storing duplicate content, that is, equivalent to deleting the block that implements the content repetition.

However, in the process of implementing the present invention, the inventors have found that the prior art has at least the following drawbacks: In the scenario of storing large amounts of data, a large amount of block data is generated in the process of deduplication, and a large amount in the fingerprint database When the block fingerprints are compared one by one, the calculation amount and the memory requirement are large, which results in low efficiency of the deduplication process.

Summary of the invention The present invention provides a deduplication method and apparatus, which solves the problem that the amount of calculation and the consumption of resources required for deduplication in the prior art are large, resulting in low deduplication performance.

The present invention provides a method for deduplication, including:

Performing a block processing on the storage file, and calculating a fingerprint of each block in the block processing result; sampling the fingerprint of each block, and generating a fingerprint sample table of the file to be stored according to the extracted fingerprint;

Determining, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling database;

And performing deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar grouping in the fingerprint database;

The packet sampling library is composed of at least one sample group, the fingerprint library is composed of at least one fingerprint group, and each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library. The similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.

The present invention provides a data deduplication apparatus, including:

a blocking module, configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result;

a sampling module, configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints;

a grouping module, configured to determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the grouping sample library;

a deduplication module, configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar grouping in the fingerprint database;

Wherein the packet sampling library is composed of at least one sample group, the fingerprint library is composed of at least one fingerprint group, each sample group in the group sample library and a sample in each fingerprint sample table in the fingerprint library A sample group that matches the fingerprint.

The method and device for deduplication provided by the present invention performs block processing on a file to be stored, calculates a fingerprint of each block, samples a fingerprint of each block, and determines the waiting according to the generated fingerprint sample table and the group sampling database. Store a similar group of files in the grouping sample library, and according to the fingerprint library The fingerprint data in the fingerprint group corresponding to the similar group is subjected to deduplication processing on the file to be stored; in this embodiment, the fingerprint of each block is further sampled, and similar groups are first determined by similarity analysis, and then Deduplication processing is performed in the fingerprint group corresponding to the similar group, which reduces the amount of deduplication query calculation, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the prior art is huge, and the repetition is reduced. Deduplication calculation in data deletion, improved deduplication performance

The drawings used in the embodiments or the description of the prior art are briefly described. It is obvious that the drawings in the following description are some embodiments of the present invention, and are not creative to those skilled in the art. Other drawings can also be obtained from these drawings on the premise of labor.

1 is a schematic diagram of a process of a data deduplication technology in the prior art;

2 is a schematic structural diagram of a fingerprint library of a data deduplication technology in the prior art;

3 is a flowchart of Embodiment 1 of a method for deleting data in the present invention;

4 is a flowchart of Embodiment 2 of a method for deleting data in the present invention;

5 is a schematic structural diagram of a packet sampling library in the second embodiment of the data deduplication method of the present invention; FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication device of the present invention;

FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention. The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. The embodiments are a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

FIG. 3 is a flowchart of Embodiment 1 of the data deduplication method of the present invention. As shown in FIG. 3, the embodiment provides a method for deleting data, which may specifically include the following steps:

Step 301: Perform block processing on the storage file, and calculate each block in the block processing result. Fingerprint.

In this embodiment, the same deduplication method is performed for the storage of each file, and the file is the file to be stored before being stored. In this step, the storage file is processed into blocks, and the specific block processing process can use the blocking technology in the prior art, for example, by using the variable length blocking algorithm to block the storage file. Then calculate the fingerprint of each block obtained after the block processing. The specific fingerprint calculation process can also use the calculation method in the prior art. For example, the shal and md5 double hash algorithm can be used to calculate the fingerprint of each block. .

Step 302: Perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints.

In this embodiment, in order to reduce the amount of deduplication in the deduplication process, after obtaining the fingerprints of the blocks of the file to be deleted, the fingerprints are sampled, and the basic requirement of the sampling is that the fingerprint in the sampling result is waiting The range of fingerprints of each block of the deleted file is deleted, and the number of fingerprints in the sampled result is not more than the number of blocked fingerprints of the file to be deleted.

Specifically, the sampling factor can be used to sample the fingerprints of each block, and the sampling factor refers to the sampling feature value used for logical operation with each block fingerprint of the file, and the sampling process is based on the sampling rule and the logical operation result. Select a part of the fingerprint in the fingerprint. Different sampling factors are selected for different files to be stored, and the sampling factor can represent the characteristics of the file to be stored. Here, the fingerprint of each block is sampled, the fingerprint is selected, and the fingerprint is extracted according to the fingerprint. Generate a fingerprint sample table of the file to be stored. The sampling factor may be determined according to the file size and the number of blocks of the file to be stored, or may be sampled by using a predetermined fixed sampling factor. The fingerprint in the fingerprint sampling table may be a fingerprint that is retained by the selection and can represent the characteristics of the file to be stored, so that the storage amount of the subsequent fingerprint can be reduced.

Alternatively, the sampling process of the fingerprints of each block in this step may also be independent of the sampling factor, specifically: directly taking the fingerprint of the last byte of each block fingerprint as the fingerprint extracted by the sampling process; Or use the block at the fixed position as the extracted fingerprint, for example, the block at the integer multiple of 9 is extracted as the fingerprint; or according to the predetermined sampling ratio, for example, randomly extracting 5% of the block as the extraction Fingerprint to.

Step 303: Determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling database.

After obtaining the fingerprint sampling table of the file to be stored, this step is based on the fingerprint sampling table and The current packet sampling library saved in the storage system determines a similar group to which the file to be stored belongs in the current packet sampling library. Wherein, the packet sampling library is composed of at least one sample group, each sample group including a sample fingerprint of one or more stored files having similarities. The similar packet is a sample packet in the packet sampling library that matches the sample fingerprint in the fingerprint sample table of the file to be stored. In addition to the packet sampling library, a fingerprint library is also provided in the embodiment, and the fingerprint library is composed of at least one fingerprint group. Each fingerprint group is corresponding to each sample group in the grouping sample library, and the fingerprint stored in the fingerprint group is the fingerprint of the stored file after the deduplication process, that is, each sample group of the group sample library is saved The sample fingerprint of each stored file of similarity, and each fingerprint group of the fingerprint library stores all the fingerprints of the stored files after deduplication. If the file to be stored processed in this step is the first file, the packet sampling library at this time is empty. The step may be: performing matching processing on each fingerprint in the fingerprint sampling table and the current grouping sampling database, and obtaining a sampling group matching the sampling fingerprint of the file to be stored from the grouping sampling database by matching, that is, obtaining the waiting group A similar group to which the file belongs.

Step 304: Perform data deletion on the file to be stored according to fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.

After determining, by the above steps, the similar group to which the file to be stored belongs in the current grouping sample library, since the sampled group in the grouping sample library and the fingerprint group in the fingerprint database have a corresponding relationship, the similar grouping is determined according to the determined Determining a fingerprint group corresponding thereto in the fingerprint database, and performing deduplication processing on the file to be stored according to the determined fingerprint data in the fingerprint group. The specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint stored in the fingerprint group corresponding to the determined similar group. If a fingerprint corresponding to a block has been saved in the fingerprint group corresponding to the similar group, the data of the block is deleted; if the fingerprint group corresponding to the similar group does not have the same or similar fingerprint as a block, The blockd data is then stored. It can be seen that the embodiment reduces the fingerprint matching the query in the deduplication process from the fingerprint database to a fingerprint packet in the fingerprint database, which greatly reduces the calculation amount of the query matching.

The embodiment provides a method for deleting data by performing block processing on a file to be stored, calculating a fingerprint of each block, sampling a fingerprint of each block, and determining the fingerprint sample table and the group sampling database according to the generated fingerprint sample table. The similarity of the file to be stored in the grouping sample library And performing a data deletion process on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; in this embodiment, the fingerprint of each block is further sampled, and the similarity is first adopted. The analysis determines the similar grouping, and performs the deduplication processing in the fingerprint group corresponding to the similar grouping, which reduces the amount of deduplication query calculation, and solves the huge calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication. The problem is to reduce the amount of deduplication in deduplication and improve the deduplication performance.

FIG. 4 is a flowchart of Embodiment 2 of the method for deleting data in the present invention. As shown in FIG. 4, the embodiment provides a method for deleting data, which may specifically include the following steps:

Step 401: Perform a block processing on the storage file, and calculate a fingerprint of each block in the block processing result. This step may be similar to the foregoing step 301, and details are not described herein again.

Step 402: Determine a sampling factor according to a file feature of the file to be stored.

After the file to be stored is divided and the fingerprint of each block is calculated, the fingerprint of each block is sampled in this embodiment. This step is to determine the sampling factor used for sampling. Specifically, the step determines a sampling factor according to the file characteristics of the file to be stored, where the file feature of the file to be stored may be the file size, the number of blocks, and the like of the file to be stored, and the sampling factors determined by different files to be stored may be different. . For example, when the number of blocks of the file to be stored is greater than 1 million, the sampling factor of the file to be stored is determined to be OxFFF; when the number of blocks of the file to be stored is less than 1 million and greater than 100,000, the sampling of the file to be stored is determined. The factor is 0x3FF; when the number of blocks of the file to be stored is less than 100,000 and greater than 10,000, the sampling factor of the file to be stored is determined to be 0x2FF; when the number of blocks of the file to be stored is less than 10,000, the file to be stored is determined. The sampling factor is 0x3F.

Step 403: Sampling the fingerprints of all the blocks of the stored file by using the sampling factor according to the set sampling condition.

When sampling the fingerprints of each block, sampling can be performed using the sampling factor according to the set sampling conditions. For example, the fingerprint of each block can be performed in conjunction with the sampling factor to determine whether the result is 0. If it is 0, the set sampling condition is met.

Step 404: Add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.

Through the sampling process of the above steps, the sampling results corresponding to each block are respectively obtained, and the fingerprint of the block corresponding to the sampling result meeting the sampling condition is added to the fingerprint sampling of the file to be stored. In the table. For example, the fingerprint of the block with the result of the sampling factor and the result of 0 may be added to the fingerprint sampling table, and the rest may be retained in the database in which the fingerprint is originally stored, thereby forming a fingerprint sampling table of the file to be stored.

Step 405: Determine whether the fingerprint sampling table of the file to be stored is empty. If yes, execute step 406; otherwise, perform step 407.

After the fingerprint sampling table of the file to be stored is obtained, the file to be stored is matched and stored in a group according to the fingerprint sampling table and the current group sampling database. In this step, it is determined whether the fingerprint sampling table of the file to be stored is empty, that is, whether the fingerprinting process satisfies the fingerprint of the sampling condition, and if yes, step 406 is performed; otherwise, step 407 is performed.

Step 406: Determine a similar group to which the file to be stored belongs in the group sampling database as a preset group in the group sampling database, and perform step 411.

When the fingerprint sampling table of the file to be stored is empty, it indicates that the sampling result obtained by the sampling process does not satisfy the sampling condition, that is, the block that satisfies the sampling condition does not exist in the file to be stored, and then the file to be stored is determined to be in the current The similar grouping in the group sampling database is the preset grouping in the current grouping sampling database. The similarity analysis process in this embodiment ends, and the subsequent step 411 is executed, in the fingerprint group corresponding to the preset group in the fingerprint database. Deduplication processing of stored files. The preset grouping is a group preset in this embodiment, and has no specific meaning, the preset grouping may be empty, which corresponds to a specific fingerprint group in the fingerprint library, and the specific fingerprint group is saved in the specific fingerprint group. These sampled fingerprint samples are empty fingerprints of the files to be stored. In the actual sampling process, there is a special case where the fingerprint sampling table is empty after sampling. Here, only the processing in this special case is explained to avoid the interruption of the entire process due to such special circumstances.

Step 407: Perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database.

When the fingerprint sampling table of the file to be stored is not empty, it indicates that the sampling process satisfies the sampling condition satisfying the sampling condition, and the fingerprints satisfying the sampling condition stored in the fingerprint sampling table are matched, specifically, the fingerprint sampling table is Each fingerprint is matched with the current packet sampling library. The fingerprints in the packet sampling library are stored in groups, and the fingerprints in each group are sampled fingerprints of one or more files having a certain similarity. In this step, the fingerprints in the fingerprint sampling table and the fingerprints in each group in the group sampling database are compared one by one in units of each group in the current grouping sampling database, and the matching result corresponding to each group can be obtained, and the matching result is obtained. For The similarity between the fingerprint in the fingerprint sampling table and the fingerprint in the corresponding group, for example, the similarity may be the ratio of the number of fingerprints that are the same as or similar to the fingerprint in the corresponding group to the total number of fingerprints in the fingerprint sampling table.

Step 408: Determine whether the capacity of the packet sampling library has reached the capacity upper limit. If yes, execute step 409; otherwise, perform step 410.

In this step, it is determined whether the capacity of the current packet sampling library has reached the upper limit of the capacity. As shown in FIG. 5, the structure of the packet sampling library in the second embodiment of the deduplication method of the present invention is determined, that is, whether the packet sampling library is full or not is determined. If yes, go to step 409, otherwise go to step 410.

Step 409: Determine, that the similar group to which the file to be stored belongs in the grouping sample database is a group with the highest similarity between each of the fingerprint sampling tables and the fingerprint sampling table.

In this embodiment, if the capacity of the current packet sampling library has reached the upper limit of the capacity, the packet with the highest similarity between each fingerprint in the current fingerprint sampling database and the fingerprint sampling table is obtained from the matching result, and the determination is made. The similar group to which the storage file belongs in the current packet sampling database is the group with the highest similarity among the fingerprint samples in the current sample sampling database, and step 411 is performed.

Step 410: Determine, according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the group sampling database, a similar group to which the file to be stored belongs in the grouping sample database.

By matching each fingerprint in the fingerprint sampling table with each group in the current grouping sampling database, determining the to-be-storage according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the current grouping sampling database A similar group to which the file belongs in the current grouping of sample samples. Specifically, when the fingerprint similarity between the fingerprints in the fingerprint sampling table and one sampling group in the current grouping sampling database is greater than or equal to a preset similarity threshold, the file to be stored is considered to belong to the sampling group, directly Determining that the similar group to which the file to be stored belongs in the current packet sampling library is the sample group, and performing step 411. In this embodiment, when the fingerprint matching is performed, when the first sampling group that satisfies the above similarity condition occurs, the sampling group is used as the similar group selected by the similarity analysis, and the subsequent sampling with other sampling groups is no longer performed. The matching process can reduce the computational complexity of the similarity analysis algorithm and improve the performance of the similarity analysis algorithm. When the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the current packet sampling database is less than a preset similarity threshold, it indicates that the file to be stored does not belong to any packet in the current packet sampling library. Then create a new group in the current grouping sample library, determine The similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the newly created group, and step 411 is performed. In this embodiment, by matching the fingerprints of the respective groups one by one, when the first packet satisfying the similarity threshold occurs, the packet is selected as the group selected by the similarity analysis, and the subsequent packets need not be matched. This embodiment significantly reduces the amount of calculation of the similarity analysis algorithm.

Further, in this embodiment, when the file to be stored is the first file to be stored, determining, according to the fingerprint sampling table and the current grouping sample database, the file to be stored belongs to the current grouping sample library. Similar grouping: establishing a new grouping in the current grouping sample library, determining that the similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and sampling the fingerprint of the file to be stored The fingerprints in the table are saved to the new group.

Step 411: Perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.

After determining the similar group to which the file to be stored belongs in the current packet sampling library through the above steps, the data to be stored in the fingerprint database corresponding to the similar grouping in the fingerprint database is subjected to deduplication processing. The specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint saved in the fingerprint group corresponding to the similar group. If a fingerprint corresponding to a block has been saved in the fingerprint group corresponding to the similar group, the data of the block is deleted; if the fingerprint group corresponding to the similar group does not have the same or similar fingerprint as a block, The blockd data is then stored. It can be seen that, in this embodiment, the fingerprint matching range of the query in the deduplication process is reduced from the entire fingerprint database to a fingerprint group in the fingerprint database, which greatly reduces the calculation amount of the query matching.

This embodiment provides a method for deleting data by performing block processing on a file to be stored, and calculating a fingerprint of each block, and sampling the fingerprint by using a sampling factor, according to the generated fingerprint sample table and the current grouping sample database. Determining a similar group to which the file to be stored belongs in the current packet sampling database, and performing deduplication processing on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; The fingerprints of each block are further sampled, and the similarity group is first determined by the similarity analysis, and then the data deletion processing is performed in the fingerprint group corresponding to the similar group, thereby reducing the calculation amount of the deduplication query. The problem that the calculation amount and the resource consumption introduced by the massive block data in the deduplication in the prior art is solved is solved, the dequantization calculation amount in the deduplication is reduced, and the deduplication performance is improved.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication apparatus of the present invention. As shown in FIG. 6, the embodiment provides a deduplication apparatus, which may specifically perform the steps in Embodiment 1 of the foregoing method. Let me repeat. The deduplication apparatus provided in this embodiment may specifically include a blocking module 601, a sampling module 602, a grouping module 603, and a deduplication module 604. The blocking module 601 is configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result. The sampling module 602 is configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints. The grouping module 603 is configured to determine a similar grouping of the file to be stored in the grouping sample library based on the fingerprint sampling table and the grouping sampling database. The deduplication module 604 is configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database. The packet sampling library is composed of at least one sample group, the fingerprint library is composed of at least one fingerprint group, and each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library. The similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.

FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention. As shown in FIG. 7, the embodiment provides a deduplication apparatus, which may specifically perform the steps in the second embodiment of the foregoing method. Let me repeat. The deduplication apparatus provided in this embodiment is based on the above-described FIG. 6, and the sampling module 602 may specifically include a determining unit 612, a sampling unit 622, and a generating unit 632. The determining unit 612 is configured to determine a sampling factor according to a file feature of the file to be stored, where the file feature includes a file size and a number of blocks of the file to be stored. The sampling unit 622 is configured to perform sampling processing on the fingerprints of all the blocks of the file to be stored by using the sampling factor according to the set sampling condition. The generating unit 632 is configured to add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored. Specifically, the grouping module 603 in this embodiment may specifically include a first grouping unit 613, a matching unit 623, and a second grouping unit 633. The first grouping unit 613 is configured to determine, when the fingerprint sampling table is empty, that the similar group to which the file to be stored belongs in the grouping sample library is a preset group in the grouping sample database. The matching unit 623 is configured to perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database when the fingerprint sampling table is not empty. The second grouping unit 633 is configured to determine, according to the matching result, a similar group to which the file to be stored belongs in the grouping sample library.

Further, the second grouping unit 633 in this embodiment may specifically include a first grouping subunit 6331, a second grouping subunit 6332, and a third grouping subunit 6333. The first grouping subunit 6331 is configured to determine, if the capacity of the packet sampling library has reached a capacity upper limit, a similar grouping that the file to be stored belongs to in the grouping sample database, and the fingerprint in the grouping sampling database The group with the highest similarity of each fingerprint in the sampling table. The second grouping sub-unit 6332 is configured to: if the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint in the fingerprint sampling table and the fingerprint group of the sample sampling group have a similarity of the fingerprint greater than or equal to When the preset similarity threshold is used, it is determined that the similar group to which the file to be stored belongs in the packet sampling library is the sample group. The third grouping sub-unit 6333 is configured to: if the capacity of the grouping sample library does not reach the upper limit of the capacity, and the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the groups in the grouping sample library is less than a preset a similarity threshold, establishing a new group in the grouping sample library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and sampling the fingerprint of the file to be stored The fingerprint in is saved to the new group.

Specifically, when the file to be stored is the first file to be stored, the grouping module 603 in this embodiment may be specifically configured to establish a new group in the grouping sample library, and determine the file to be stored. The similar group to which the packet sampling library belongs is the new group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the new group.

The embodiment provides a data deduplication device, which performs block processing on the file to be stored, calculates a fingerprint of each block, and performs sampling processing on the fingerprint; and determines the file to be stored according to the generated fingerprint sample table and the group sampling database. And the similar groupings in the grouping of the groupings; and performing the data deletion processing on the files to be stored according to the fingerprint data in the fingerprint group corresponding to the similar groupings in the fingerprint database; For further sampling processing, similarity analysis is first determined by similarity analysis, and then weighted in fingerprint groups corresponding to similar groups. The data deletion processing reduces the dequantization query calculation amount, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication is large, and the calculation amount of deduplication in the deduplication is reduced. Improved deduplication performance.

It should be noted that the above embodiments are only for explaining the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: The technical solutions described in the foregoing embodiments are modified, or some of the technical features are equivalently replaced; and the modifications or substitutions do not deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

Rights request

A method for deduplication, characterized in that it comprises:

Performing a block processing on the storage file, calculating a fingerprint of each block in the block processing result; sampling the fingerprints, and generating a fingerprint sampling table of the file to be stored according to the extracted fingerprint;

Determining, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling library;

Deduplicating the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database;

The packet sampling library is composed of at least one sample group, and the fingerprint library is composed of at least one fingerprint group. Each fingerprint group of the fingerprint library stores all fingerprints after the data file is deduplicated. Each sample group in the sample library corresponds to each fingerprint group in the fingerprint library - each sample group includes sample fingerprints of one or more stored files having similarities, and the similar group is sampled by the group A sample group in the library that matches the sample fingerprint in the fingerprint sample table of the file to be stored.

The method according to claim 1, wherein the sampling of the fingerprints of the blocks is performed, and the fingerprint sampling table of the file to be stored is generated according to the extracted fingerprints, including:

Determining, according to a file feature of the file to be stored, a sampling factor, where the file feature includes a file size and a number of blocks of the file to be stored;

And sampling, by using the sampling factor, the fingerprints of all the blocks of the file to be stored according to the set sampling condition;

The fingerprint of each block in the sampling result is added to the fingerprint sampling table of the file to be stored.

The method according to claim 1 or 2, wherein the determining, according to the fingerprint sampling table and the packet sampling library, the similar grouping of the file to be stored in the grouping sample library, comprising:

When the fingerprint sampling table is empty, determining that the similar group to which the file to be stored belongs in the grouping sample library is a preset group in the group sampling database;

When the fingerprint sampling table is not empty, each fingerprint in the fingerprint sampling table is grouped The sampling library performs a matching process; and according to the matching result, determining a similar group to which the file to be stored belongs in the grouping sample library.

The method according to claim 3, wherein the determining, according to the matching result, the similar group to which the file to be stored belongs in the grouping sample database comprises:

If the capacity of the packet sampling library has reached the upper limit of the capacity, determining that the similar group to which the file to be stored belongs in the grouping sample library is, the similarity of each fingerprint in the grouping sample database and the fingerprint sampling table Highest grouping;

If the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint similarity between the fingerprints in the fingerprint sampling table and one sample group in the packet sampling database is greater than or equal to a preset similarity threshold, And the similar group to which the file to be stored belongs in the grouping sample library is the sample group;

If the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the packet sampling database is less than a preset similarity threshold, Establishing a new group in the grouping sample library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and saving the fingerprint in the fingerprint sampling table of the file to be stored to the In the new grouping.

The method according to claim 1, wherein when the file to be stored is the first file to be stored, the determining the file to be stored according to the fingerprint sampling table and the packet sampling database The similar groupings in the grouping sample library are as follows:

Establishing a new group in the packet sampling library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and storing the fingerprint in the fingerprint sampling table of the file to be stored Go to the new group.

6. A data deduplication device, comprising:

a sampling module, configured to sample the fingerprints, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints;

a data deduplication module, configured to perform deduplication on the file to be stored according to fingerprint data in a fingerprint group corresponding to the similar group in the fingerprint database; The packet sampling library is composed of at least one sample group, and the fingerprint library is composed of at least one fingerprint group. Each fingerprint group of the fingerprint library stores all fingerprints after the data file is deduplicated. Each sample group in the sample library corresponds to each fingerprint group in the fingerprint library - each sample group includes sample fingerprints of one or more stored files having similarities, and the similar group is sampled by the group A sample group in the library that matches the sample fingerprint in the fingerprint sample table of the file to be stored.

The device according to claim 6, wherein the sampling module comprises: a determining unit, configured to determine a sampling factor according to a file feature of the file to be stored, the file feature comprising the file to be stored File size and number of chunks;

a sampling unit, configured to sample, by using the sampling factor, a fingerprint of all the blocks of the file to be stored according to the set sampling condition;

And a generating unit, configured to add a fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.

8. The apparatus according to claim 6 or 7, wherein the grouping module comprises:

a first grouping unit, configured to: when the fingerprint sampling table is empty, determine that a similar group to which the file to be stored belongs in the grouping sample library is a preset group in a group sampling database; a matching unit, configured to be used When the fingerprint sampling table is not empty, matching each fingerprint in the fingerprint sampling table with the group sampling database;

And a second grouping unit, configured to determine, according to a matching result of the matching unit, a similar group to which the to-be-stored file belongs in the grouping sample database.

9. The apparatus according to claim 8, wherein the second grouping unit comprises:

a first grouping subunit, configured to determine, if the capacity of the packet sampling library has reached a capacity upper limit, a similar grouping of the file to be stored in the grouping sample library, the fingerprinting library and the fingerprint The group with the highest similarity of each fingerprint in the sampling table;

a second grouping subunit, configured to: if a capacity of the grouping sample library does not reach a capacity upper limit, and when a fingerprint similarity between each fingerprint in the fingerprint sampling table and a sample group in the grouping sample library is greater than or equal to Determining, by the preset similarity threshold, a similar group to which the file to be stored belongs in the grouping sample library is the sample group;

a third grouping subunit, configured to: if the capacity of the packet sampling library does not reach a capacity upper limit, And when the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the packet sampling library is less than a preset similarity threshold, establishing a new group in the group sampling database, determining the The similar group to which the file to be stored belongs in the packet sampling library is the new group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the new group.

The apparatus according to claim 6, wherein when the file to be stored is the first file to be stored, the grouping module is specifically configured to establish a new group in the grouping sample library, Determining a similar group to which the file to be stored belongs in the packet sampling library is the new group, and saving a fingerprint in the fingerprint sampling table of the file to be stored into the new group.