WO2013075668A1 - Procédé et dispositif de suppression de données en double - Google Patents

Procédé et dispositif de suppression de données en double Download PDF

Info

Publication number
WO2013075668A1
WO2013075668A1 PCT/CN2012/085278 CN2012085278W WO2013075668A1 WO 2013075668 A1 WO2013075668 A1 WO 2013075668A1 CN 2012085278 W CN2012085278 W CN 2012085278W WO 2013075668 A1 WO2013075668 A1 WO 2013075668A1
Authority
WO
WIPO (PCT)
Prior art keywords
fingerprint
sampling
file
group
stored
Prior art date
Application number
PCT/CN2012/085278
Other languages
English (en)
Chinese (zh)
Inventor
付旭东
徐君
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2013075668A1 publication Critical patent/WO2013075668A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Definitions

  • Deduplication is a data reduction technique commonly used in disk-based backup systems to reduce the storage capacity used in storage systems.
  • deduplication technology deals with large data volume scenarios.
  • the industry's deduplication technology mainly includes techniques such as block detection, similarity detection, and delta coding. Among them, similarity detection and Delta coding are the other two.
  • a data compression method but only detects similar files, and the deduplication rate is low.
  • Deduplication is an important indicator of the effectiveness of deduplication and identifies the rate at which duplicate data is removed.
  • FIG. 1 is a schematic diagram of a process of the data deduplication technology in the prior art.
  • a huge block data index table is established in the memory to maintain the index of the block data.
  • the data object is subjected to block processing, and the fingerprint of each block in the block processing result is calculated, and the fingerprint of each block is stored in the block data index table (ie, the fingerprint database), as shown in FIG. 2
  • the block data index table ie, the fingerprint database
  • FIG. 2 A schematic diagram of the structure of a fingerprint library of the prior art deduplication technology is shown. When the data is stored, the data index table is first queried.
  • the data to be stored is not stored, and only the new fingerprint in the block data index table is not queried. Block, thereby avoiding the block of storing duplicate content, that is, equivalent to deleting the block that implements the content repetition.
  • the prior art has at least the following drawbacks: In the scenario of storing large amounts of data, a large amount of block data is generated in the process of deduplication, and a large amount in the fingerprint database When the block fingerprints are compared one by one, the calculation amount and the memory requirement are large, which results in low efficiency of the deduplication process.
  • the present invention provides a deduplication method and apparatus, which solves the problem that the amount of calculation and the consumption of resources required for deduplication in the prior art are large, resulting in low deduplication performance.
  • the present invention provides a method for deduplication, including:
  • the packet sampling library is composed of at least one sample group
  • the fingerprint library is composed of at least one fingerprint group
  • each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library.
  • the similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
  • the present invention provides a data deduplication apparatus, including:
  • a blocking module configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result
  • a sampling module configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints;
  • a grouping module configured to determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the grouping sample library;
  • a deduplication module configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar grouping in the fingerprint database;
  • the packet sampling library is composed of at least one sample group
  • the fingerprint library is composed of at least one fingerprint group, each sample group in the group sample library and a sample in each fingerprint sample table in the fingerprint library A sample group that matches the fingerprint.
  • the method and device for deduplication performs block processing on a file to be stored, calculates a fingerprint of each block, samples a fingerprint of each block, and determines the waiting according to the generated fingerprint sample table and the group sampling database.
  • Store a similar group of files in the grouping sample library, and according to the fingerprint library The fingerprint data in the fingerprint group corresponding to the similar group is subjected to deduplication processing on the file to be stored; in this embodiment, the fingerprint of each block is further sampled, and similar groups are first determined by similarity analysis, and then Deduplication processing is performed in the fingerprint group corresponding to the similar group, which reduces the amount of deduplication query calculation, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the prior art is huge, and the repetition is reduced. Deduplication calculation in data deletion, improved deduplication performance
  • FIG. 1 is a schematic diagram of a process of a data deduplication technology in the prior art
  • FIG. 2 is a schematic structural diagram of a fingerprint library of a data deduplication technology in the prior art
  • Embodiment 3 is a flowchart of Embodiment 1 of a method for deleting data in the present invention
  • Embodiment 4 is a flowchart of Embodiment 2 of a method for deleting data in the present invention
  • FIG. 5 is a schematic structural diagram of a packet sampling library in the second embodiment of the data deduplication method of the present invention
  • FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication device of the present invention
  • FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention.
  • the technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention.
  • the embodiments are a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
  • FIG. 3 is a flowchart of Embodiment 1 of the data deduplication method of the present invention. As shown in FIG. 3, the embodiment provides a method for deleting data, which may specifically include the following steps:
  • Step 301 Perform block processing on the storage file, and calculate each block in the block processing result. Fingerprint.
  • the same deduplication method is performed for the storage of each file, and the file is the file to be stored before being stored.
  • the storage file is processed into blocks, and the specific block processing process can use the blocking technology in the prior art, for example, by using the variable length blocking algorithm to block the storage file.
  • the specific fingerprint calculation process can also use the calculation method in the prior art. For example, the shal and md5 double hash algorithm can be used to calculate the fingerprint of each block. .
  • Step 302 Perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints.
  • the fingerprints are sampled, and the basic requirement of the sampling is that the fingerprint in the sampling result is waiting
  • the range of fingerprints of each block of the deleted file is deleted, and the number of fingerprints in the sampled result is not more than the number of blocked fingerprints of the file to be deleted.
  • the sampling factor can be used to sample the fingerprints of each block, and the sampling factor refers to the sampling feature value used for logical operation with each block fingerprint of the file, and the sampling process is based on the sampling rule and the logical operation result. Select a part of the fingerprint in the fingerprint. Different sampling factors are selected for different files to be stored, and the sampling factor can represent the characteristics of the file to be stored. Here, the fingerprint of each block is sampled, the fingerprint is selected, and the fingerprint is extracted according to the fingerprint. Generate a fingerprint sample table of the file to be stored. The sampling factor may be determined according to the file size and the number of blocks of the file to be stored, or may be sampled by using a predetermined fixed sampling factor. The fingerprint in the fingerprint sampling table may be a fingerprint that is retained by the selection and can represent the characteristics of the file to be stored, so that the storage amount of the subsequent fingerprint can be reduced.
  • the sampling process of the fingerprints of each block in this step may also be independent of the sampling factor, specifically: directly taking the fingerprint of the last byte of each block fingerprint as the fingerprint extracted by the sampling process; Or use the block at the fixed position as the extracted fingerprint, for example, the block at the integer multiple of 9 is extracted as the fingerprint; or according to the predetermined sampling ratio, for example, randomly extracting 5% of the block as the extraction Fingerprint to.
  • Step 303 Determine, according to the fingerprint sampling table and the group sampling database, a similar group to which the file to be stored belongs in the group sampling database.
  • this step is based on the fingerprint sampling table and
  • the current packet sampling library saved in the storage system determines a similar group to which the file to be stored belongs in the current packet sampling library.
  • the packet sampling library is composed of at least one sample group, each sample group including a sample fingerprint of one or more stored files having similarities.
  • the similar packet is a sample packet in the packet sampling library that matches the sample fingerprint in the fingerprint sample table of the file to be stored.
  • a fingerprint library is also provided in the embodiment, and the fingerprint library is composed of at least one fingerprint group.
  • Each fingerprint group is corresponding to each sample group in the grouping sample library, and the fingerprint stored in the fingerprint group is the fingerprint of the stored file after the deduplication process, that is, each sample group of the group sample library is saved
  • the sample fingerprint of each stored file of similarity, and each fingerprint group of the fingerprint library stores all the fingerprints of the stored files after deduplication. If the file to be stored processed in this step is the first file, the packet sampling library at this time is empty.
  • the step may be: performing matching processing on each fingerprint in the fingerprint sampling table and the current grouping sampling database, and obtaining a sampling group matching the sampling fingerprint of the file to be stored from the grouping sampling database by matching, that is, obtaining the waiting group A similar group to which the file belongs.
  • Step 304 Perform data deletion on the file to be stored according to fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
  • the similar grouping is determined according to the determined Determining a fingerprint group corresponding thereto in the fingerprint database, and performing deduplication processing on the file to be stored according to the determined fingerprint data in the fingerprint group.
  • the specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint stored in the fingerprint group corresponding to the determined similar group.
  • the embodiment reduces the fingerprint matching the query in the deduplication process from the fingerprint database to a fingerprint packet in the fingerprint database, which greatly reduces the calculation amount of the query matching.
  • the embodiment provides a method for deleting data by performing block processing on a file to be stored, calculating a fingerprint of each block, sampling a fingerprint of each block, and determining the fingerprint sample table and the group sampling database according to the generated fingerprint sample table.
  • the similarity of the file to be stored in the grouping sample library And performing a data deletion process on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; in this embodiment, the fingerprint of each block is further sampled, and the similarity is first adopted.
  • the analysis determines the similar grouping, and performs the deduplication processing in the fingerprint group corresponding to the similar grouping, which reduces the amount of deduplication query calculation, and solves the huge calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication.
  • the problem is to reduce the amount of deduplication in deduplication and improve the deduplication performance.
  • FIG. 4 is a flowchart of Embodiment 2 of the method for deleting data in the present invention. As shown in FIG. 4, the embodiment provides a method for deleting data, which may specifically include the following steps:
  • Step 401 Perform a block processing on the storage file, and calculate a fingerprint of each block in the block processing result. This step may be similar to the foregoing step 301, and details are not described herein again.
  • Step 402 Determine a sampling factor according to a file feature of the file to be stored.
  • This step is to determine the sampling factor used for sampling. Specifically, the step determines a sampling factor according to the file characteristics of the file to be stored, where the file feature of the file to be stored may be the file size, the number of blocks, and the like of the file to be stored, and the sampling factors determined by different files to be stored may be different. . For example, when the number of blocks of the file to be stored is greater than 1 million, the sampling factor of the file to be stored is determined to be OxFFF; when the number of blocks of the file to be stored is less than 1 million and greater than 100,000, the sampling of the file to be stored is determined.
  • the factor is 0x3FF; when the number of blocks of the file to be stored is less than 100,000 and greater than 10,000, the sampling factor of the file to be stored is determined to be 0x2FF; when the number of blocks of the file to be stored is less than 10,000, the file to be stored is determined.
  • the sampling factor is 0x3F.
  • Step 403 Sampling the fingerprints of all the blocks of the stored file by using the sampling factor according to the set sampling condition.
  • sampling can be performed using the sampling factor according to the set sampling conditions. For example, the fingerprint of each block can be performed in conjunction with the sampling factor to determine whether the result is 0. If it is 0, the set sampling condition is met.
  • Step 404 Add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.
  • the sampling results corresponding to each block are respectively obtained, and the fingerprint of the block corresponding to the sampling result meeting the sampling condition is added to the fingerprint sampling of the file to be stored.
  • the fingerprint of the block with the result of the sampling factor and the result of 0 may be added to the fingerprint sampling table, and the rest may be retained in the database in which the fingerprint is originally stored, thereby forming a fingerprint sampling table of the file to be stored.
  • Step 405 Determine whether the fingerprint sampling table of the file to be stored is empty. If yes, execute step 406; otherwise, perform step 407.
  • the file to be stored is matched and stored in a group according to the fingerprint sampling table and the current group sampling database. In this step, it is determined whether the fingerprint sampling table of the file to be stored is empty, that is, whether the fingerprinting process satisfies the fingerprint of the sampling condition, and if yes, step 406 is performed; otherwise, step 407 is performed.
  • Step 406 Determine a similar group to which the file to be stored belongs in the group sampling database as a preset group in the group sampling database, and perform step 411.
  • the fingerprint sampling table of the file to be stored When the fingerprint sampling table of the file to be stored is empty, it indicates that the sampling result obtained by the sampling process does not satisfy the sampling condition, that is, the block that satisfies the sampling condition does not exist in the file to be stored, and then the file to be stored is determined to be in the current
  • the similar grouping in the group sampling database is the preset grouping in the current grouping sampling database.
  • the similarity analysis process in this embodiment ends, and the subsequent step 411 is executed, in the fingerprint group corresponding to the preset group in the fingerprint database. Deduplication processing of stored files.
  • the preset grouping is a group preset in this embodiment, and has no specific meaning, the preset grouping may be empty, which corresponds to a specific fingerprint group in the fingerprint library, and the specific fingerprint group is saved in the specific fingerprint group.
  • These sampled fingerprint samples are empty fingerprints of the files to be stored.
  • the fingerprint sampling table is empty after sampling.
  • only the processing in this special case is explained to avoid the interruption of the entire process due to such special circumstances.
  • Step 407 Perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database.
  • the fingerprint sampling table of the file to be stored When the fingerprint sampling table of the file to be stored is not empty, it indicates that the sampling process satisfies the sampling condition satisfying the sampling condition, and the fingerprints satisfying the sampling condition stored in the fingerprint sampling table are matched, specifically, the fingerprint sampling table is Each fingerprint is matched with the current packet sampling library.
  • the fingerprints in the packet sampling library are stored in groups, and the fingerprints in each group are sampled fingerprints of one or more files having a certain similarity.
  • the fingerprints in the fingerprint sampling table and the fingerprints in each group in the group sampling database are compared one by one in units of each group in the current grouping sampling database, and the matching result corresponding to each group can be obtained, and the matching result is obtained.
  • the similarity may be the ratio of the number of fingerprints that are the same as or similar to the fingerprint in the corresponding group to the total number of fingerprints in the fingerprint sampling table.
  • Step 408 Determine whether the capacity of the packet sampling library has reached the capacity upper limit. If yes, execute step 409; otherwise, perform step 410.
  • this step it is determined whether the capacity of the current packet sampling library has reached the upper limit of the capacity.
  • the structure of the packet sampling library in the second embodiment of the deduplication method of the present invention is determined, that is, whether the packet sampling library is full or not is determined. If yes, go to step 409, otherwise go to step 410.
  • Step 409 Determine, that the similar group to which the file to be stored belongs in the grouping sample database is a group with the highest similarity between each of the fingerprint sampling tables and the fingerprint sampling table.
  • the packet with the highest similarity between each fingerprint in the current fingerprint sampling database and the fingerprint sampling table is obtained from the matching result, and the determination is made.
  • the similar group to which the storage file belongs in the current packet sampling database is the group with the highest similarity among the fingerprint samples in the current sample sampling database, and step 411 is performed.
  • Step 410 Determine, according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the group sampling database, a similar group to which the file to be stored belongs in the grouping sample database.
  • determining the to-be-storage according to the fingerprint similarity between each fingerprint in the fingerprint sampling table and each fingerprint group in the current grouping sampling database A similar group to which the file belongs in the current grouping of sample samples. Specifically, when the fingerprint similarity between the fingerprints in the fingerprint sampling table and one sampling group in the current grouping sampling database is greater than or equal to a preset similarity threshold, the file to be stored is considered to belong to the sampling group, directly Determining that the similar group to which the file to be stored belongs in the current packet sampling library is the sample group, and performing step 411.
  • the fingerprint matching when the first sampling group that satisfies the above similarity condition occurs, the sampling group is used as the similar group selected by the similarity analysis, and the subsequent sampling with other sampling groups is no longer performed.
  • the matching process can reduce the computational complexity of the similarity analysis algorithm and improve the performance of the similarity analysis algorithm.
  • the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the packets in the current packet sampling database is less than a preset similarity threshold, it indicates that the file to be stored does not belong to any packet in the current packet sampling library.
  • step 411 create a new group in the current grouping sample library, determine The similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the newly created group, and step 411 is performed.
  • the packet is selected as the group selected by the similarity analysis, and the subsequent packets need not be matched. This embodiment significantly reduces the amount of calculation of the similarity analysis algorithm.
  • the file to be stored when the file to be stored is the first file to be stored, determining, according to the fingerprint sampling table and the current grouping sample database, the file to be stored belongs to the current grouping sample library. Similar grouping: establishing a new grouping in the current grouping sample library, determining that the similar group to which the file to be stored belongs in the current grouping sample library is the newly created group, and sampling the fingerprint of the file to be stored The fingerprints in the table are saved to the new group.
  • Step 411 Perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
  • the data to be stored in the fingerprint database corresponding to the similar grouping in the fingerprint database is subjected to deduplication processing.
  • the specific deletion method may be similar to the prior art, that is, the fingerprint of each block of the calculated file to be stored is matched with the fingerprint saved in the fingerprint group corresponding to the similar group. If a fingerprint corresponding to a block has been saved in the fingerprint group corresponding to the similar group, the data of the block is deleted; if the fingerprint group corresponding to the similar group does not have the same or similar fingerprint as a block, The blockd data is then stored. It can be seen that, in this embodiment, the fingerprint matching range of the query in the deduplication process is reduced from the entire fingerprint database to a fingerprint group in the fingerprint database, which greatly reduces the calculation amount of the query matching.
  • This embodiment provides a method for deleting data by performing block processing on a file to be stored, and calculating a fingerprint of each block, and sampling the fingerprint by using a sampling factor, according to the generated fingerprint sample table and the current grouping sample database. Determining a similar group to which the file to be stored belongs in the current packet sampling database, and performing deduplication processing on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database; The fingerprints of each block are further sampled, and the similarity group is first determined by the similarity analysis, and then the data deletion processing is performed in the fingerprint group corresponding to the similar group, thereby reducing the calculation amount of the deduplication query.
  • the problem that the calculation amount and the resource consumption introduced by the massive block data in the deduplication in the prior art is solved is solved, the dequantization calculation amount in the deduplication is reduced, and the deduplication performance is improved.
  • FIG. 6 is a structural diagram of Embodiment 1 of the data deduplication apparatus of the present invention.
  • the embodiment provides a deduplication apparatus, which may specifically perform the steps in Embodiment 1 of the foregoing method. Let me repeat.
  • the deduplication apparatus provided in this embodiment may specifically include a blocking module 601, a sampling module 602, a grouping module 603, and a deduplication module 604.
  • the blocking module 601 is configured to perform block processing on the file to be stored, and calculate a fingerprint of each block in the block processing result.
  • the sampling module 602 is configured to perform sampling processing on the fingerprints of the blocks, and generate a fingerprint sampling table of the file to be stored according to the extracted fingerprints.
  • the grouping module 603 is configured to determine a similar grouping of the file to be stored in the grouping sample library based on the fingerprint sampling table and the grouping sampling database.
  • the deduplication module 604 is configured to perform deduplication on the file to be stored according to the fingerprint data in the fingerprint group corresponding to the similar group in the fingerprint database.
  • the packet sampling library is composed of at least one sample group
  • the fingerprint library is composed of at least one fingerprint group
  • each sample group in the group sample library corresponds to each fingerprint group in the fingerprint library.
  • the similar grouping is a sampling group in the packet sampling library that matches the sampling fingerprint in the fingerprint sampling table of the file to be stored.
  • FIG. 7 is a structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention.
  • the embodiment provides a deduplication apparatus, which may specifically perform the steps in the second embodiment of the foregoing method. Let me repeat.
  • the deduplication apparatus provided in this embodiment is based on the above-described FIG. 6, and the sampling module 602 may specifically include a determining unit 612, a sampling unit 622, and a generating unit 632.
  • the determining unit 612 is configured to determine a sampling factor according to a file feature of the file to be stored, where the file feature includes a file size and a number of blocks of the file to be stored.
  • the sampling unit 622 is configured to perform sampling processing on the fingerprints of all the blocks of the file to be stored by using the sampling factor according to the set sampling condition.
  • the generating unit 632 is configured to add the fingerprint of each block in the sampling result to the fingerprint sampling table of the file to be stored.
  • the grouping module 603 in this embodiment may specifically include a first grouping unit 613, a matching unit 623, and a second grouping unit 633.
  • the first grouping unit 613 is configured to determine, when the fingerprint sampling table is empty, that the similar group to which the file to be stored belongs in the grouping sample library is a preset group in the grouping sample database.
  • the matching unit 623 is configured to perform matching processing on each fingerprint in the fingerprint sampling table and the group sampling database when the fingerprint sampling table is not empty.
  • the second grouping unit 633 is configured to determine, according to the matching result, a similar group to which the file to be stored belongs in the grouping sample library.
  • the second grouping unit 633 in this embodiment may specifically include a first grouping subunit 6331, a second grouping subunit 6332, and a third grouping subunit 6333.
  • the first grouping subunit 6331 is configured to determine, if the capacity of the packet sampling library has reached a capacity upper limit, a similar grouping that the file to be stored belongs to in the grouping sample database, and the fingerprint in the grouping sampling database The group with the highest similarity of each fingerprint in the sampling table.
  • the second grouping sub-unit 6332 is configured to: if the capacity of the packet sampling library does not reach the upper limit of the capacity, and when the fingerprint in the fingerprint sampling table and the fingerprint group of the sample sampling group have a similarity of the fingerprint greater than or equal to When the preset similarity threshold is used, it is determined that the similar group to which the file to be stored belongs in the packet sampling library is the sample group.
  • the third grouping sub-unit 6333 is configured to: if the capacity of the grouping sample library does not reach the upper limit of the capacity, and the fingerprint similarity between each fingerprint in the fingerprint sampling table and all the groups in the grouping sample library is less than a preset a similarity threshold, establishing a new group in the grouping sample library, determining that the similar group to which the file to be stored belongs in the grouping sample library is the new group, and sampling the fingerprint of the file to be stored The fingerprint in is saved to the new group.
  • the grouping module 603 in this embodiment may be specifically configured to establish a new group in the grouping sample library, and determine the file to be stored.
  • the similar group to which the packet sampling library belongs is the new group, and the fingerprint in the fingerprint sampling table of the file to be stored is saved in the new group.
  • the embodiment provides a data deduplication device, which performs block processing on the file to be stored, calculates a fingerprint of each block, and performs sampling processing on the fingerprint; and determines the file to be stored according to the generated fingerprint sample table and the group sampling database. And the similar groupings in the grouping of the groupings; and performing the data deletion processing on the files to be stored according to the fingerprint data in the fingerprint group corresponding to the similar groupings in the fingerprint database; For further sampling processing, similarity analysis is first determined by similarity analysis, and then weighted in fingerprint groups corresponding to similar groups.
  • the data deletion processing reduces the dequantization query calculation amount, and solves the problem that the calculation amount and resource consumption introduced by the massive block data in the prior art in the deduplication is large, and the calculation amount of deduplication in the deduplication is reduced. Improved deduplication performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un dispositif de suppression de données en double. Ledit procédé consiste : à diviser un fichier à stocker et à calculer l'empreinte numérique de chaque partition dans le résultat du processus de division ; à échantillonner l'empreinte numérique de chaque partition et à générer une table d'échantillonnage d'empreintes numériques pour le fichier à stocker, conformément à l'empreinte numérique échantillonnée ; à déterminer un regroupement similaire du fichier à stocker dans une bibliothèque d'échantillonnage de regroupements conformément à la table d'échantillonnage d'empreintes numériques et à la bibliothèque d'échantillonnage de regroupements ; et à réaliser la suppression des données en double dans le fichier à stocker conformément aux données d'empreintes numériques dans un regroupement d'empreintes numériques correspondant au regroupement similaire dans une bibliothèque d'empreintes numériques. Ledit dispositif comprend : un module de division, un module d'échantillonnage, un module de regroupement et un module de suppression des données en double. La présente invention résout le problème rencontré dans l'état de la technique, où un grand nombre de données divisées entraîne une importante quantité de calculs et une consommation de ressources élevée pendant la suppression des doubles, et elle limite la quantité de calculs de la déduplication pendant la suppression des données en double.
PCT/CN2012/085278 2011-11-25 2012-11-26 Procédé et dispositif de suppression de données en double WO2013075668A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110380773.3A CN103150260B (zh) 2011-11-25 2011-11-25 重复数据删除方法和装置
CN201110380773.3 2011-11-25

Publications (1)

Publication Number Publication Date
WO2013075668A1 true WO2013075668A1 (fr) 2013-05-30

Family

ID=48469137

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/085278 WO2013075668A1 (fr) 2011-11-25 2012-11-26 Procédé et dispositif de suppression de données en double

Country Status (2)

Country Link
CN (1) CN103150260B (fr)
WO (1) WO2013075668A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941605A (zh) * 2019-11-07 2020-03-31 北京浪潮数据技术有限公司 重复数据的在线删除方法、装置及可读存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103930890B (zh) * 2013-10-30 2015-09-23 华为技术有限公司 数据处理方法、装置及重删处理器
CN103631933B (zh) * 2013-12-06 2017-04-12 中国科学院计算技术研究所 一种面向分布式去重系统的数据路由方法
CN104205097B (zh) * 2013-12-31 2017-08-25 华为技术有限公司 一种去重方法装置与系统
CN103995863B (zh) * 2014-05-19 2018-06-19 华为技术有限公司 一种重复数据删除的方法及装置
WO2017214793A1 (fr) * 2016-06-13 2017-12-21 北京小米移动软件有限公司 Procédé et appareil de génération de modèle d'empreinte digitale
CN106409317B (zh) * 2016-09-29 2020-02-07 北京小米移动软件有限公司 梦话提取方法、装置及用于提取梦话的装置
CN107451204B (zh) * 2017-07-10 2021-01-05 创新先进技术有限公司 一种数据查询方法、装置及设备
CN108280628A (zh) * 2018-02-01 2018-07-13 泰康保险集团股份有限公司 基于区块链技术的核赔方法、装置、介质及电子设备
CN111488269B (zh) * 2019-01-29 2023-11-14 阿里巴巴集团控股有限公司 数据仓库的指标检测方法、装置和系统
CN116991329B (zh) * 2023-09-25 2023-12-08 深圳市明泰智能技术有限公司 一种自助服务终端设备的数据防冗余方法和系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008062145A1 (fr) * 2006-11-22 2008-05-29 Half Minute Media Limited Création d'empreintes digitales
CN101374234A (zh) * 2008-09-25 2009-02-25 清华大学 一种基于内容的视频拷贝监测方法及装置
CN102222085A (zh) * 2011-05-17 2011-10-19 华中科技大学 一种基于相似性与局部性结合的重复数据删除方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214210B (zh) * 2011-05-16 2013-03-13 华为数字技术(成都)有限公司 重复数据处理方法、装置和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008062145A1 (fr) * 2006-11-22 2008-05-29 Half Minute Media Limited Création d'empreintes digitales
CN101374234A (zh) * 2008-09-25 2009-02-25 清华大学 一种基于内容的视频拷贝监测方法及装置
CN102222085A (zh) * 2011-05-17 2011-10-19 华中科技大学 一种基于相似性与局部性结合的重复数据删除方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941605A (zh) * 2019-11-07 2020-03-31 北京浪潮数据技术有限公司 重复数据的在线删除方法、装置及可读存储介质
CN110941605B (zh) * 2019-11-07 2022-07-08 北京浪潮数据技术有限公司 重复数据的在线删除方法、装置及可读存储介质

Also Published As

Publication number Publication date
CN103150260A (zh) 2013-06-12
CN103150260B (zh) 2016-06-08

Similar Documents

Publication Publication Date Title
WO2013075668A1 (fr) Procédé et dispositif de suppression de données en double
EP2940598B1 (fr) Procédé et dispositif de traitement d'objet de données
US9223794B2 (en) Method and apparatus for content-aware and adaptive deduplication
CN102214210B (zh) 重复数据处理方法、装置和系统
WO2013086969A1 (fr) Procédé, dispositif et système permettant de trouver des données en double
WO2017096532A1 (fr) Procédé et appareil de stockage de données
WO2014094479A1 (fr) Procédé et dispositif permettant de supprimer des données dupliquées
Bo et al. Research on chunking algorithms of data de-duplication
CN106611035A (zh) 一种云存储中重复数据删除的检索算法
WO2012065408A1 (fr) Procédé et système de sauvegarde de données à tolérance élevée aux sinistres
US10509771B2 (en) System and method for data storage, transfer, synchronization, and security using recursive encoding
WO2014037767A1 (fr) Déduplication de données en ligne multi-niveau
WO2014067063A1 (fr) Procédé et dispositif de récupération de données en double
CN108415671B (zh) 一种面向绿色云计算的重复数据删除方法及系统
Xu et al. A lightweight virtual machine image deduplication backup approach in cloud environment
WO2014000458A1 (fr) Procédé et dispositif de traitement de petits fichiers
CN105515586B (zh) 一种快速差量压缩方法
US10162832B1 (en) Data aware deduplication
WO2021082926A1 (fr) Procédé et appareil de compression de données
Zhou et al. Hysteresis re-chunking based metadata harnessing deduplication of disk images
Kim et al. Design and implementation of binary file similarity evaluation system
CN112162973A (zh) 指纹碰撞规避、去重及恢复方法、存储介质和去重系统
JP6113816B1 (ja) 情報処理システム、情報処理装置、及びプログラム
CN104281412A (zh) 一种在数据存储前去除重复数据的方法
WO2018036290A1 (fr) Procédé de compression de données et terminal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12851328

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12851328

Country of ref document: EP

Kind code of ref document: A1