WO2014094479A1 - 重复数据删除方法和装置 - Google Patents

重复数据删除方法和装置 Download PDF

Info

Publication number
WO2014094479A1
WO2014094479A1 PCT/CN2013/084542 CN2013084542W WO2014094479A1 WO 2014094479 A1 WO2014094479 A1 WO 2014094479A1 CN 2013084542 W CN2013084542 W CN 2013084542W WO 2014094479 A1 WO2014094479 A1 WO 2014094479A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
file
fingerprint
data block
hash table
Prior art date
Application number
PCT/CN2013/084542
Other languages
English (en)
French (fr)
Inventor
祁蕊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014094479A1 publication Critical patent/WO2014094479A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Definitions

  • Embodiments of the present invention relate to data processing technologies, and in particular, to a data deduplication method and apparatus. Background technique
  • Deduplication technology referred to as deduplication operation, is a mainstream storage technology. By retrieving duplicate data in files, it eliminates redundant data, thereby improving the efficiency of the storage system, reducing storage space and saving costs.
  • a file to be processed is usually divided into a plurality of smaller intermediate files. For each data block of each intermediate file, calculate and compare the data fingerprints, store the hash fingerprints of the unique data blocks that are not repeated, and store them in the hash table, and then retrieve the data fingerprints in the hash table to obtain The data block repetition rate, the duplicate data block is deleted.
  • the calculated data fingerprint is 128 bits. If the file to be processed is relatively large, there are more unique data blocks, and the hash table will occupy a large amount of memory, affecting Backup efficiency. If a large file is divided into small intermediate files and then deduplicated, when an intermediate file completes the deduplication operation, the hash table of the intermediate file is cleared, and when the next intermediate file is deduplicated, a new one is generated. Hash table. In this way, the array of duplicate data blocks between intermediate files is lacking, which increases the repetition rate of the data blocks in the file to be processed, thereby affecting the space saving rate of the files. Summary of the invention
  • the embodiment of the invention provides a method and device for deleting data to reduce the data block repetition rate of a file and improve the space utilization of the file storage.
  • An embodiment of the present invention provides a method for deleting data, including:
  • Calculating a data fingerprint of each data block in the to-be-processed file Performing a deduplication operation on the data block of the file to be processed according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table, where the data fingerprint in the hotspot hash table is repeated in at least one file The data fingerprint whose number reaches the set threshold.
  • the method further includes:
  • the hotspot hash table is updated according to the number of times the data fingerprint is repeated in the file to be processed.
  • updating the hotspot hash table according to the number of times the data fingerprint is repeatedly generated in the to-be-processed file includes:
  • the dividing the to-be-processed file into the at least two data blocks includes: dividing the to-be-processed file into at least two intermediate files, and dividing each intermediate file into Into at least two data blocks;
  • the method further includes: updating, according to a data fingerprint of each data block in each of the intermediate files, a hash table corresponding to the intermediate file; After the de-duplication processing is completed, the hash table corresponding to the intermediate file is cleared.
  • the data fingerprint of each data block and the data fingerprint in the hotspot hash table are used for the to-be-processed file
  • Performing the deduplication operation on the data block includes: matching, according to the data fingerprint of each data block of the intermediate file, in the data fingerprint of the hotspot hash table;
  • Deduplication of intermediate files includes:
  • the data block is compared with the data block corresponding to the matching consistent data fingerprint, and if they are consistent, the data block is deleted.
  • the method before calculating the data fingerprint of each data block in the to-be-processed file, the method further includes: Compress.
  • a data deduplication apparatus including:
  • a data block dividing module configured to divide the file to be processed into at least two data blocks; a calculating module, configured to calculate a data fingerprint of each data block in the file to be processed; and a first de-duplication module, configured to use each data block
  • the data fingerprint in the data fingerprint and the hotspot hash table performs a deduplication operation on the data block of the to-be-processed file, where the data fingerprint in the hotspot hash table is set to be repeated in at least one file.
  • the data fingerprint of the threshold configured to divide the file to be processed into at least two data blocks.
  • the foregoing data deduplication apparatus further includes:
  • the hotspot hash table update module is configured to update the hotspot hash table according to the number of times the data fingerprints in the to-be-processed file are repeated after the data fingerprint of each data block in the to-be-processed file is calculated.
  • the hotspot hash table update module includes:
  • a statistical unit configured to count the number of occurrences of each data fingerprint after calculating a data fingerprint of each data block in the to-be-processed file, or after calculating a data fingerprint of each data block;
  • the writing unit is configured to write the data fingerprint whose number of occurrences reaches the set threshold into the hotspot hash table.
  • the data block division is divided into at least two data blocks;
  • the device also includes:
  • An update module configured to: after calculating a data fingerprint of each data block in the to-be-processed file, update a hash table corresponding to the intermediate file according to a data fingerprint of each data block in each of the intermediate files; ;
  • the clearing module is configured to clear the hash table corresponding to the intermediate file after the deduplication processing of the intermediate file is completed.
  • the first de-duplication module includes: before the row de-duplication operation, each data according to the intermediate file The data fingerprint of the block is matched in the data fingerprint of the hotspot hash table;
  • a first deleting unit configured to perform byte comparison on the data block and the data block corresponding to the matching consistent data fingerprint when the matching is consistent, and if the comparison is consistent, deleting the data block;
  • a trigger unit configured to trigger, when the matching is inconsistent, the deduplication operation performed by the second deduplication module on the intermediate file according to the hash table of the intermediate file.
  • the second de-duplication module includes: a unit, configured to perform matching in a data fingerprint of a hash table of the intermediate file according to a data fingerprint of each data block of the intermediate file;
  • comparing the deleting unit configured to perform byte comparison on the data block and the data block corresponding to the matching consistent data fingerprint when the matching is consistent, and if the comparison is consistent, deleting the data block.
  • the foregoing data deduplication device further includes:
  • a compression module configured to compress each data block before calculating a data fingerprint of each data block in the to-be-processed file.
  • the deduplication method and apparatus of the embodiment of the present invention performs a deduplication operation by using a hotspot hash table, so that the deduplication operation of the file to be processed can consider a data fingerprint with a higher number of repeated occurrences, in particular, can be considered in multiple files. Repeated data fingerprints can reduce the repetition rate of file data blocks in the storage process and improve the space utilization of file storage.
  • Embodiment 1 is a flowchart of Embodiment 1 of a method for deleting data according to the present invention
  • Embodiment 2 is a flowchart of Embodiment 2 of a method for deleting data in the present invention
  • Embodiment 3 is a flowchart of Embodiment 3 of a method for deleting data in the present invention
  • Embodiment 4 is a schematic structural diagram of Embodiment 1 of a data deduplication apparatus according to the present invention.
  • Embodiment 2 of a data deduplication apparatus according to the present invention
  • Embodiment 3 of a data deduplication device according to the present invention.
  • FIG. 7 is a schematic structural diagram of an encryption device according to Embodiment 4 of the data deduplication apparatus of the present invention. detailed description
  • Embodiment 1 is a flowchart of Embodiment 1 of a method for deleting data in the present invention. As shown in FIG. 1, the method in this embodiment may include:
  • Step 101 Divide the file to be processed into at least two data blocks.
  • the to-be-processed file refers to all the files included in one storage action, and may be a single file, multiple files, a single volume, and multiple virtual data.
  • commonly used division methods such as fixed length division, slider division, and so on.
  • fixed-length partitioning is the simplest method, and its performance is relatively high. It is suitable for dividing stable files into data blocks.
  • Step 102 Calculate a data fingerprint of each data block in the to-be-processed file.
  • the collision probability of the data fingerprint calculated by the MD5 and SHA1 algorithms is relatively small. Therefore, in this embodiment, the MD5 algorithm or the SHA1 algorithm can be used to calculate the data fingerprint of each data block in the file to be processed.
  • Step 103 Perform a deduplication operation on the data block of the file to be processed according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table.
  • the data fingerprint in the hotspot hash table is a data fingerprint whose number of repetitions reaches a set threshold in at least one file.
  • the at least one file may refer to the current file to be processed, and may also refer to other files that have obtained the hotspot hash table, for example, a history file that has been processed, or a combination of a history file and a current pending file. .
  • the deduplication operation is performed by using the hotspot hash table, so that the deduplication operation of the file to be processed can consider the data fingerprint with a higher number of repeated occurrences, and in particular, can be repeatedly repeated in multiple files.
  • the data fingerprint can reduce the repetition rate of file data blocks in the storage process and improve the space utilization of file storage.
  • the hotspot hash table is different from the general hash table. Instead of storing the data fingerprint of the unique data block, only the data fingerprint with a high number of repetitions is stored. Therefore, the data volume is small in scale, which can reduce the occupation of the memory.
  • Embodiment 2 is a flowchart of Embodiment 2 of the method for deleting data in the present invention. As shown in FIG. 2, this embodiment is a method for deleting data in the case where a hotspot hash table has been stored in a memory as a template. Methods can include:
  • Step 201 Divide the file to be processed into at least two data blocks. Two data blocks. In general, if the data block specified by the system is too large, the processing efficiency will be affected. Preferably, each data block size is specified in the system to be 64K.
  • the file to be processed may be directly divided into at least two data blocks; or, the file to be processed is divided into The lesser two data blocks include: dividing the file to be processed into at least two intermediate files, and dividing each intermediate file into at least two data blocks.
  • the deduplication method is divided into at least two data blocks as an example for detailed description. The method for demultiplexing the to-be-processed file into at least two data blocks is the same as that of the deduplication method, and is not described here.
  • Step 202 Compress each data block.
  • step 201 at least two data blocks divided in step 201 can be compressed by means of a compression tool, and the following steps are all performed under the compression format to further reduce the storage space.
  • Step 203 Read in the stored hotspot hash table.
  • the present embodiment is a deduplication method in the case where the hotspot hash table has been stored as a template in the memory, the stored hotspot hash table is read in before the initialization of the hash table.
  • Step 204 Initialize a hash table.
  • the initialization hash table is specifically: Create a new hash table, that is, define a hash table for each intermediate file.
  • a hash table of one of the at least two intermediate files is defined. Specifically, the header information of the hash table, the file size to be stored, the data block size, the offset, and the like are stored in the newly created hash table.
  • the header information of the hash table includes basic information of the file, such as file size, file name, file format, etc.
  • the offset indicates the specific location information of the data block on the disk.
  • Step 205 Calculate a data fingerprint of each data block in the file to be processed.
  • the MD5 algorithm or the SHA1 algorithm is used to calculate the data fingerprint of each data block of the current intermediate file in the file to be processed.
  • Step 206 Update the hash table corresponding to the intermediate file according to the data fingerprint of each data block in the intermediate file.
  • Step 207 Update the hotspot hash table according to the repeated occurrence times of the data fingerprint in the file to be processed.
  • the step is an optional step, in which the hotspot hash table is updated according to the number of repeated occurrences of the data fingerprint in the file to be processed, which may be based on only the number of occurrences of the data fingerprint of the file to be processed, or may be based on the history file and The number of occurrences of the data fingerprint counted in the file to be processed, and the number of occurrences of the same data fingerprint is accumulated to update the hotspot hash table.
  • updating the hotspot hash table according to the number of times the data fingerprint is repeatedly generated in the to-be-processed file includes:
  • the file to be processed is the current intermediate file, and the set threshold may be set according to experience.
  • each data fingerprint may be sorted according to the number of occurrences, and then the data fingerprint with a high number of occurrences is extracted and written into the hotspot hash table to update the original hotspot hash table stored as a template in the memory.
  • Step 208 Perform a deduplication operation on the data block of the file to be processed according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table.
  • the data fingerprint in the hotspot hash table is a data fingerprint whose number of repetitions reaches a set threshold in at least one file.
  • the at least one file may refer to the current file to be processed, and may also refer to other files that have obtained the hotspot hash table, for example, a history file that has been processed, or a combination of a history file and a current pending file. .
  • performing deduplication operations on the data blocks of the processing file according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table includes:
  • the data block is compared with the data block corresponding to the matching data fingerprint. If the data blocks are compared, the data block is deleted; and the deduplication operation is performed.
  • the collision problem must be considered, that is, different.
  • the data block generates the same data fingerprint scene. Therefore, by performing byte comparison on the data block, it is finally confirmed whether the data block contents are identical, that is, whether it is a duplicate data block. Specifically, when the data fingerprint of one data block of the intermediate file matches the data fingerprint of the hotspot hash table, byte comparison is performed on the data block and the corresponding data block of the data fingerprint in the hotspot hash table, if If the data block is consistent, the data fingerprint of the data block is added to the data fingerprint of the data block to distinguish it from the data fingerprint in the hotspot hash table, and the data fingerprint with the identifier is written into the hash table. .
  • adding an identifier to the data fingerprint may add a field to the data fingerprint or adopt another identifier.
  • the deduplication operation in this step refers to the data block deletion in which the data fingerprint of the data fingerprint and the hotspot hash table are consistently matched and the bytes are relatively consistent.
  • Step 209 Perform a deduplication operation on the intermediate file according to the hash table of the intermediate file to generate a new file.
  • the new file generated is the backup file.
  • Deduplicating the intermediate file according to the hash table of the intermediate file includes:
  • the data block is compared with the data block corresponding to the matching consistent data fingerprint, and if they are consistent, the data block is deleted.
  • the collision problem that is, the scenario in which different data blocks generate the same data fingerprint
  • the collision problem that is, the scenario in which different data blocks generate the same data fingerprint
  • the data fingerprint of one data block of the intermediate file matches the data fingerprint of the hash table of the intermediate file
  • byte comparison is performed on the data block and the corresponding data block of the data fingerprint in the hash table, if If the comparison is consistent, the data block is deleted.
  • an identifier is added to the data fingerprint of the data block to distinguish it from the data fingerprint in the hash table, and the data fingerprint with the identifier is written into the hash table.
  • adding an identifier to the data fingerprint may add a field to the data fingerprint or use another identifier.
  • the data fingerprint of one data block of the intermediate file does not match the data fingerprint of the hash table of the intermediate file, the data fingerprint of the data block is written into the hash table of the intermediate file.
  • steps 205 to 209 are looped until the intermediate file is deduplicated.
  • Step 210 After the deduplication processing of the intermediate file is completed, clear the hash corresponding to the intermediate file. Table.
  • the hotspot hash table is used to increase the comparison between the data block and the duplicate data between the files, and the hotspot hash table and the hash table are combined to perform the deduplication operation, so that the file to be processed is processed.
  • the de-duplication operation can consider the data fingerprints with higher repetition times, especially the data fingerprints that are repeated in multiple files, which can reduce the repetition rate of file data blocks in the storage process and improve the space utilization of file storage;
  • the hotspot hash table is different from the general hash table. Instead of storing the data fingerprint of the unique data block, only the data fingerprint with high repetition rate is stored.
  • the data volume is small in scale, which can reduce the occupation of the memory;
  • the storage space is further reduced by compressing the data block;
  • the data fingerprint of the repeated occurrence times reaching the set threshold is written into the hotspot hash table to achieve the purpose of updating the hotspot hash table;
  • the data blocks are compared in bytes to achieve accurate determination of duplicate data blocks.
  • Embodiment 3 is a flowchart of Embodiment 3 of the data deduplication method of the present invention.
  • the difference between this embodiment and the embodiment shown in FIG. 2 is that the hotspot hash table is not stored as a template in the memory, but needs to be
  • the hotspot hash table is generated in the process, and the method in this embodiment may include:
  • Step 301 Divide the file to be processed into at least two data blocks. Two data blocks. In general, if the data block specified by the system is too large, the processing efficiency will be affected. Preferably, each data block size is specified in the system to be 64K.
  • the file to be processed may be directly divided into at least two data blocks; or, dividing the file to be processed into at least two data blocks includes: dividing the file to be processed into at least two intermediate files, each of which will be The intermediate files are divided into at least two data blocks.
  • the deduplication method is divided into at least two data blocks as an example for detailed description. The method for demultiplexing the to-be-processed file into at least two data blocks is the same as that of the deduplication method, and is not described here.
  • Step 302 Compress each data block.
  • At least two data blocks divided in step 301 can be compressed by means of a compression tool, and the following steps are all performed under the compression format to further reduce the storage space.
  • Step 303 Initialize a hotspot hash table.
  • the hotspot hash table is specifically: A hash table, a hotspot hash table that defines a file to be processed.
  • the hotspot hash table of the file to be processed is defined, that is, information such as the header information of the hotspot hash table, the file size to be stored, the data block size, the offset, and the like are stored in the newly created hotspot hash table.
  • the header information of the hotspot hash table includes basic information of the file, such as file size, file name, file format, etc.
  • the offset indicates the specific location information of the data block on the disk.
  • Step 304 Initialize a hash table.
  • the initialization hash table is specifically: Create a new hash table, that is, define a hash table for each intermediate file.
  • a hash table of one of the at least two intermediate files is defined. Specifically, the header information of the hash table, the file size to be stored, the data block size, the offset, and the like are stored in the newly created hash table.
  • the header information of the hash table includes basic information of the file, such as file size, file name, file format, etc.
  • the offset indicates the specific location information of the data block on the disk.
  • Step 305 Calculate a data fingerprint of each data block in the file to be processed.
  • the MD5 algorithm or the SHA1 algorithm is used to calculate the data fingerprint of each data block of the current intermediate file in the file to be processed.
  • Step 306 Update the hash table corresponding to the intermediate file according to the data fingerprint of each data block in the intermediate file.
  • the data fingerprint of each data block in the intermediate file is compared with the data fingerprint stored in the current hash table. If the data fingerprint of the data block in the intermediate file is inconsistent with the data fingerprint stored in the current hash table, The data fingerprint is stored in the hash table to ultimately save the data fingerprint of the unique data block in the hash table.
  • Step 307 Update the hotspot hash table according to the repeated occurrence times of the data fingerprint in the file to be processed.
  • the data fingerprint corresponding to the threshold value of the data fingerprint pointed to by the data fingerprint may be written into the hotspot hash table by querying the hash table; or the data of each data block in the file to be processed is calculated.
  • the number of occurrences of each data fingerprint is counted; the data fingerprint whose number of occurrences reaches the set threshold is written into the hotspot hash table.
  • obtaining the hotspot hash table by querying the hash table is specifically: determining the threshold according to experience, and then, if a certain data fingerprint in the hash table points to the number of the repeated data blocks is greater than the threshold, the system
  • the data fingerprint is written into the hotspot hash table, and the hotspot hash table stores the data fingerprint corresponding to the hot data block.
  • the hotspot hash table is stored in the memory and can be applied in the deduplication operation of the subsequent file.
  • a hotspot hash table can also be obtained by the number of data fingerprint occurrences.
  • the hotspot hash table is stored as a template in memory.
  • Step 308 Perform a deduplication operation on the data block of the file to be processed according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table.
  • the data fingerprint in the hotspot hash table is a data fingerprint whose number of repetitions reaches a set threshold in at least one file.
  • the at least one file refers to a file to be processed currently.
  • performing deduplication operations on the data blocks of the processing file according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table includes:
  • the data block is compared with the data block corresponding to the matching data fingerprint. If the data blocks are compared, the data block is deleted; and the deduplication operation is performed.
  • the collision problem that is, the scenario in which different data blocks generate the same data fingerprint
  • the collision problem that is, the scenario in which different data blocks generate the same data fingerprint
  • the data fingerprint of one data block of the intermediate file matches the data fingerprint of the hotspot hash table
  • byte comparison is performed on the data block and the corresponding data block of the data fingerprint in the hotspot hash table, if If the data block is consistent, if the comparison is inconsistent, the data fingerprint of the data block is marked.
  • the deduplication operation in this step refers to the data block deletion in which the data fingerprint of the data fingerprint and the hotspot hash table are consistently matched and the bytes are relatively consistent.
  • Step 309 Perform a deduplication operation on the intermediate file according to the hash table of the intermediate file to generate a new file.
  • the new file generated is the backup file.
  • Deduplicating the intermediate file according to the hash table of the intermediate file includes:
  • the data block is compared with the data block corresponding to the matching consistent data fingerprint, and if they are consistent, the data block is deleted.
  • the collision problem that is, the scenario in which different data blocks generate the same data fingerprint
  • the collision problem that is, the scenario in which different data blocks generate the same data fingerprint
  • the data fingerprint of one data block of the intermediate file matches the data fingerprint of the hash table of the intermediate file
  • byte comparison is performed on the data block and the corresponding data block of the data fingerprint in the hash table, if If the comparison is consistent, the data block is deleted.
  • an identifier is added to the data fingerprint of the data block to distinguish it from the data fingerprint in the hash table, and the data fingerprint with the identifier is written into the hash table.
  • adding an identifier to the data fingerprint may add a field to the data fingerprint or use another identifier.
  • the data fingerprint of one data block of the intermediate file does not match the data fingerprint of the hash table of the intermediate file, the data fingerprint of the data block is written into the hash table of the intermediate file.
  • step 305 to step 309 are looped until the intermediate file is deduplicated.
  • Step 3 10 After the deduplication processing of the intermediate file is completed, the hash table corresponding to the intermediate file is cleared.
  • the deduplication method is used in the case where the hotspot hash table has been stored in the memory as a template. Therefore, the method described in the second embodiment can be used. The process is de-reprocessed and will not be described here.
  • a hotspot hash table is generated at the same time as deduplication, and then the hotspot hash table is used to increase the comparison between the data block and the duplicate data between the files, and the hotspot hash table and the ha
  • the de-duplication operation is performed in combination with the Greek table, so that the deduplication operation of the file to be processed can consider the data fingerprint with a higher number of repeated occurrences, in particular, the data fingerprint repeated in multiple files can be considered, and the file can be reduced in the storage process.
  • the repetition rate of the data block improves the space utilization of the file storage; in addition, the hotspot hash table is different from the general hash table, and instead of storing the data fingerprint of the unique data block, only the data fingerprint with high repetition rate is stored, so The amount of data is small, which can reduce the occupation of memory; the storage space is further reduced by compressing the data block; by the data fingerprint or the number of repeated occurrences of the number of repeated data blocks pointed by the data fingerprint is greater than the threshold The data fingerprint that reaches the set threshold is written in the hotspot hash table to reach the update. Hash table object point; by the data block having the same data byte fingerprint comparison, accurate determination object duplicate data blocks.
  • the hotspot data of the backup may be extracted according to the characteristics of the restored file, and the hotspot data is stored in the memory and the cache during recovery, thereby improving the recovery data. s efficiency.
  • the apparatus of this embodiment may include: a data block division module 11, a calculation module 12, and a first deduplication module 13.
  • the data block dividing module 1 1 is configured to divide the file to be processed into at least two data blocks;
  • the calculating module 12 is configured to calculate a data fingerprint of each data block in the to-be-processed file; And performing a deduplication operation on the data block of the file to be processed according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table, where the data fingerprint in the hotspot hash table is repeated in at least one file The data fingerprint whose number of occurrences reaches the set threshold.
  • the device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1.
  • the implementation principle and technical effects are similar, and details are not described herein again.
  • Embodiment 2 is a schematic structural diagram of Embodiment 2 of the data deduplication apparatus of the present invention.
  • the apparatus of this embodiment is based on the apparatus structure shown in FIG. 4, and further includes: a hotspot hash table update.
  • the data block dividing module 11 is specifically configured to divide the file to be processed into at least two intermediate files, and divide each intermediate file into at least two data blocks; a hotspot hash table.
  • the update module 14 is configured to: after calculating the data fingerprint of each data block in the to-be-processed file, update the hotspot hash table according to the number of times the data fingerprint repeatedly appears in the to-be-processed file; the compression module 15 is configured to calculate Before the data fingerprint of each data block in the file to be processed, the data block is compressed; the update module 16 is configured to calculate, according to each of the intermediate files, the data fingerprint of each data block in the to-be-processed file.
  • the hash table corresponding to the intermediate file is updated;
  • the second deduplication module 17 is configured to perform a deduplication operation on the intermediate file according to the hash table of the intermediate file; After the deduplication processing of the intermediate file is completed, the hash table corresponding to the intermediate file is cleared.
  • the device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 2 or FIG. 3, and the implementation principle and the technical effect are similar, and details are not described herein again.
  • FIG. 6 is a schematic structural diagram of Embodiment 3 of the data deduplication apparatus of the present invention. As shown in FIG. 6, the apparatus of this embodiment is based on the apparatus structure shown in FIG. 5. Further, the hotspot hash table update module 14 may include : Statistics unit 141 and write unit 142. The statistics unit 141 is configured to calculate the number of occurrences of each data fingerprint after calculating the data fingerprint of each data block in the to-be-processed file, or after calculating the data fingerprint of each data block; Write the data fingerprint whose number of occurrences reaches the set threshold to the hotspot hash table.
  • Statistics unit 141 is configured to calculate the number of occurrences of each data fingerprint after calculating the data fingerprint of each data block in the to-be-processed file, or after calculating the data fingerprint of each data block; Write the data fingerprint whose number of occurrences reaches the set threshold to the hotspot hash table.
  • the first de-duplication module 13 may include: a first matching unit 131, a first deleting unit 132, and a trigger unit 133.
  • the first matching unit 131 is configured to: before the deduplication operation on the intermediate file according to the hash table of the intermediate file, according to the data fingerprint of each data block of the intermediate file, in the hotspot hash Matching in the data fingerprint of the table; the first deleting unit 132 is configured to perform byte comparison on the data block corresponding to the matching data fingerprint when the matching is consistent, and delete the data if the data block is consistent
  • the triggering unit 133 is configured to trigger, when the matching is inconsistent, trigger the deduplication operation performed by the second deduplication module on the intermediate file according to the hash table of the intermediate file.
  • the second de-duplication module 17 may include: a second matching unit 171 and a comparison deleting unit 172.
  • the second matching unit 171 is configured to perform matching in the data fingerprint of the hash table of the intermediate file according to the data fingerprint of each data block of the intermediate file, and compare the deleting unit 172, when the matching is consistent And performing byte comparison on the data block corresponding to the data block corresponding to the matching data fingerprint. If the data blocks are consistent, the data block is deleted.
  • the device of this embodiment may be used to execute the technical method of the method embodiment shown in FIG. 2 or FIG.
  • the implementation principle and technical effect are similar, and will not be described here.
  • the deduplication method and apparatus provided by the embodiments of the present invention can be applied to backup batch files.
  • a hotspot hash table By using a hotspot hash table, the comparison of duplicate data between data blocks and files is increased, and the space saving rate of files is improved.
  • embodiments of the present invention are also applicable to front-end deduplication and back-end deduplication, local data backup and remote data backup, and virtualized environments.
  • a virtualized environment perform full and incremental backups of bulk virtual machines. For example, for a desktop cloud system, because the operating system and application software of the virtual machine it manages have many identical files, the application hotspot hash table can quickly and efficiently perform full backup of the batch virtual machine, and the file space is greatly improved. Saving rate.
  • FIG. 7 is a block diagram showing the structure of an encryption device of a fourth embodiment of the data deduplication apparatus of the present invention.
  • the specific embodiment of the present invention does not limit the specific implementation of the network device.
  • the encryption device of this embodiment includes a processor 2101, a communication interface 2102, a memory 2103, and a bus 2104.
  • the processor 2101, the communication interface 2102, and the memory 2103 complete communication with each other through the bus 2104; the communication interface 2102 is configured to communicate with other devices; and the processor 2101 is configured to execute the program A.
  • program A can include program code, the program code including computer operating instructions.
  • the processor 2101 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • ASIC Application Specific Integrated Circuit
  • the memory 2103 is used to store the program A.
  • the memory 2103 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.
  • the program ⁇ can specifically include:
  • the program A after the calculation of the data fingerprint of each data block in the to-be-processed file, further includes: updating the location according to the number of times the data fingerprint is repeatedly generated in the to-be-processed file.
  • a hotspot hash table A hotspot hash table.
  • the program A is updated according to the number of times the data fingerprint is repeated in the file to be processed, and updating the hotspot hash table includes:
  • the method further includes: updating, according to a data fingerprint of each data block in each of the intermediate files, a hash table corresponding to the intermediate file; After the de-duplication processing is completed, the hash table corresponding to the intermediate file is cleared.
  • the program A is configured to perform deduplication on the data block of the file to be processed according to the data fingerprint of each data block and the data fingerprint in the hotspot hash table, including: data fingerprint of each data block according to the intermediate file Matching in the data fingerprint of the hotspot hash table;
  • the data block is compared with the data block corresponding to the matching data fingerprint, and if they are consistent, the data block is deleted;
  • Row deduplication operations include:
  • the data block is compared with the data block corresponding to the matching consistent data fingerprint, and if they are consistent, the data block is deleted.
  • the foregoing program A before calculating the data fingerprint of each data block in the file to be processed, further includes: compressing each data block.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the above method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

本发明实施例提供一种重复数据删除方法和装置,包括:将待处理文件划分成至少两个数据块;计算所述待处理文件中各数据块的数据指纹;根据各数据块的数据指纹和热点哈希表中的数据指纹对所述待处理文件的数据块进行去重操作,其中,所述热点哈希表中的数据指纹为在至少一个文件中重复出现次数达到设定门限值的数据指纹。本发明实施例的重复数据删除方法和装置,通过使用热点哈希表进行去重操作,降低了文件数据块的重复率,提高了文件存储的空间利用率。

Description

重复数据删除方法和装置 技术领域 本发明实施例涉及数据处理技术, 尤其涉及一种重复数据删除方法和 装置。 背景技术
重复数据删除技术, 简称去重操作, 是目前主流的一种存储技术, 通 过检索文件中重复的数据, 消除冗余数据, 从而提高存储系统的效率, 缩 减存储空间, 节约成本。
现有技术中, 通常将待处理文件划分为多个较小的中间文件。 针对每 个中间文件的各数据块, 计算其数据指纹并进行比对, 将不重复出现的唯 一数据块的哈希指纹存储在哈希表中, 进而通过检索哈希表中的数据指 纹, 获取数据块重复率, 将重复的数据块删除。
但由于现有技术中常用的哈希指纹算法, 例如 MD5算法, 计算获得 的数据指纹是 128位, 若待处理文件比较大, 则唯一数据块较多, 哈希表 就会占用大量内存, 影响备份效率。 若将大文件分为小的中间文件, 然后 进行重复数据删除, 当一个中间文件完成去重操作后, 该中间文件的哈希 表会清空, 当下一个中间文件进行重复数据删除时, 会生成新的哈希表。 这样又缺少了中间文件之间重复数据块的陣选, 增加了待处理文件中数据 块的重复率, 从而影响文件的空间节约率。 发明内容
本发明实施例提供一种重复数据删除方法和装置, 以降低文件的数据 块重复率, 提高文件存储的空间利用率。
本发明实施例一方面提供一种重复数据删除方法, 包括:
将待处理文件划分成至少两个数据块;
计算所述待处理文件中各数据块的数据指纹; 根据各数据块的数据指纹和热点哈希表中的数据指纹对所述待处理 文件的数据块进行去重操作, 其中, 所述热点哈希表中的数据指纹为在至 少一个文件中重复出现次数达到设定门限值的数据指纹。
在第一方面的第一种可能的实施方式中, 在计算所述待处理文件中各 数据块的数据指纹之后, 还包括:
根据所述待处理文件中数据指纹重复出现的次数, 更新所述热点哈希 表。
结合第一方面的第一种可能的实施方式, 在第一方面的第二种可能的 实施方式中, 根据所述待处理文件中数据指纹重复出现的次数, 更新所述 热点哈希表包括:
在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数 据块的数据指纹之后, 统计各数据指纹的出现次数;
将出现次数达到设定门限值的数据指纹写入热点哈希表中。
结合第一方面, 在第一方面的第三种可能的实施方式中, 将待处理文 件划分成至少两个数据块包括: 将待处理文件划分成至少两个中间文件, 将每个中间文件划分成至少两个数据块;
则计算所述待处理文件中各数据块的数据指纹之后, 还包括: 根据每个所述中间文件中各数据块的数据指纹, 更新所述中间文件对 应的哈希表; 在所述中间文件的去重处理完成后, 清空所述中间文件对应的哈希 表。
结合第一方面的第三种可能的实施方式, 在第一方面的第四种可能的 实施方式中, 根据各数据块的数据指纹和热点哈希表中的数据指纹对所述 待处理文件的数据块进行去重操作包括: 据所述中间文件每个数据块的数据指纹, 在所述热点哈希表的数据指纹中 进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块; 件进行的去重操作。
结合第一方面的第三种可能的实施方式或第一方面的第四种可能的 实施方式, 在第一方面的第五种可能实施方式中, 根据所述中间文件的哈 希表对所述中间文件进行去重操作包括:
根据所述中间文件每个数据块的数据指纹, 在所述中间文件的哈希表 的数据指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块。
结合第一方面到第一方面的第四种实施方式, 在第一方面的第六种可 能实施方式中,计算所述待处理文件中各数据块的数据指纹之前,还包括: 对各数据块进行压缩。
本发明实施例另一方面提供一种重复数据删除装置, 包括:
数据块划分模块, 用于将待处理文件划分成至少两个数据块; 计算模块, 用于计算所述待处理文件中各数据块的数据指纹; 第一去重模块, 用于根据各数据块的数据指纹和热点哈希表中的数据 指纹对所述待处理文件的数据块进行去重操作, 其中, 所述热点哈希表中 的数据指纹为在至少一个文件中重复出现次数达到设定门限值的数据指 纹。
在第二方面的第一种可能的实施方式中, 上述重复数据删除装置还包 括:
热点哈希表更新模块, 用于在计算所述待处理文件中各数据块的数据 指纹之后, 根据所述待处理文件中数据指纹重复出现的次数, 更新所述热 点哈希表。
结合第二方面的第一种可能的实施方式, 在第二方面的第二种可能的 实施方式中, 所述热点哈希表更新模块包括:
统计单元, 用于在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数据块的数据指纹之后, 统计各数据指纹的出现次数;
写入单元, 用于将出现次数达到设定门限值的数据指纹写入热点哈希 表中。 结合第二方面, 在第二方面的第三种可能的实施方式中, 数据块划分 分成至少两个数据块;
所述装置还包括:
更新模块, 用于在计算所述待处理文件中各数据块的数据指纹之后, 根据每个所述中间文件中各数据块的数据指纹, 更新所述中间文件对应的 哈希表; 去重操作;
清空模块, 用于在所述中间文件的去重处理完成后, 清空所述中间文 件对应的哈希表。
结合第二方面的第三种可能的实施方式, 在第二方面的第四种可能的 实施方式中, 所述第一去重模块包括: 行去重操作之前, 根据所述中间文件每个数据块的数据指纹, 在所述热点 哈希表的数据指纹中进行匹配;
第一删除单元, 用于当匹配一致时, 对所述数据块与匹配一致数据指 纹所对应的数据块进行字节比较, 若比较一致, 则删除所述数据块;
触发单元, 用于当匹配不一致时, 触发所述第二去重模块根据所述中 间文件的哈希表对所述中间文件进行的去重操作。
结合第二方面的第三种可能的实施方式或第二方面的第四种可能的 实施方式,在第二方面的第五种可能实施方式中,所述第二去重模块包括: 第二匹配单元, 用于根据所述中间文件每个数据块的数据指纹, 在所 述中间文件的哈希表的数据指纹中进行匹配;
比较删除单元, 用于当匹配一致时, 对所述数据块与匹配一致数据指 纹所对应的数据块进行字节比较, 若比较一致, 则删除所述数据块。
结合第二方面到第二方面的第四种实施方式, 在第二方面的第六种可 能实施方式中, 上述重复数据删除装置还包括:
压缩模块, 用于在计算所述待处理文件中各数据块的数据指纹之前, 对各数据块进行压缩。 本发明实施例的重复数据删除方法和装置, 通过使用热点哈希表进行 去重操作, 使得对待处理文件的去重操作可以考虑重复出现次数较高的数 据指纹, 特别是能够考虑在多个文件中重复出现的数据指纹, 能够在存储 过程中降低文件数据块的重复率, 提高文件存储的空间利用率。 附图说明 实施例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见 地, 下面描述中的附图是本发明的一些实施例, 对于本领域普通技术人员 来讲, 在不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的 附图。
图 1为本发明重复数据删除方法实施例一的流程图;
图 2为本发明重复数据删除方法实施例二的流程图;
图 3为本发明重复数据删除方法实施例三的流程图;
图 4为本发明重复数据删除装置实施例一的结构示意图;
图 5为本发明重复数据删除装置实施例二的结构示意图;
图 6为本发明重复数据删除装置实施例三的结构示意图;
图 7为本发明重复数据删除装置实施例四的加密设备的结构示意图。 具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本 发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描 述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提 下所获得的所有其他实施例, 都属于本发明保护的范围。
图 1为本发明重复数据删除方法实施例一的流程图, 如图 1所示, 本 实施例的方法可以包括:
步骤 101、 将待处理文件划分成至少两个数据块。
本步骤中, 待处理文件指的是一次存储动作下包括的所有文件, 可以 是单文件、 多文件、 单个卷及多虚拟数据等。 此外, 可根据待处理文件的 内容或者备份场景选择划分方法, 常用的划分方法如定长划分、 滑块划分 等。 一般的, 定长划分是最简捷的方法, 性能比较高, 适合用于将稳定的 文件划分成数据块。
步骤 102、 计算所述待处理文件中各数据块的数据指纹。
本步骤中, 对数据块进行计算获取数据指纹的方法有很多, 一般的,
MD5和 SHA1算法计算所得的数据指纹的碰撞几率比较小, 因此, 本实施 例中可以采用 MD5算法或 SHA1算法计算待处理文件中各数据块的数据 指纹。
步骤 103、 根据各数据块的数据指纹和热点哈希表中的数据指纹对所 述待处理文件的数据块进行去重操作。
本步骤中, 所述热点哈希表中的数据指纹为在至少一个文件中重复出 现次数达到设定门限值的数据指纹。 其中, 该至少一个文件可以指当前待 处理的文件, 也可以指获取到该热点哈希表的其它文件, 例如, 已经处理 过的历史文件, 或者是历史文件和当前待处理文件二者的结合。
本实施例的重复数据删除方法, 通过使用热点哈希表进行去重操作, 使得对待处理文件的去重操作可以考虑重复出现次数较高的数据指纹, 特 别是能够考虑在多个文件中重复出现的数据指纹, 能够在存储过程中降低 文件数据块的重复率, 提高文件存储的空间利用率。 此外, 热点哈希表区 别于一般的哈希表, 并非存储唯一数据块的数据指纹, 而是仅存储重复次 数高的数据指纹, 因此其数据量规模较小, 可减小对内存的占用。
下面采用几个具体的实施例, 对图 1所示方法实施例的技术方案进行 伴细说明。
图 2为本发明重复数据删除方法实施例二的流程图, 如图 2所示, 本 实施例是针对热点哈希表已作为模版存储在内存的情况下的重复数据删 除方法, 本实施例的方法可以包括:
步骤 201、 将待处理文件划分成至少两个数据块。 两个数据块。 一般的, 若系统规定的数据块过大, 则会影响处理效率, 优 选的, 在系统中规定每个数据块大小为 64K。 对于待处理文件, 可以直接 将该待处理文件划分成至少两个数据块; 或者, 将该待处理文件划分成至 少两个数据块包括: 将待处理文件划分成至少两个中间文件, 将每个中间 文件划分成至少两个数据块。 件划分成至少两个数据块的重复数据删除方法为例进行详细说明, 直接将 该待处理文件划分成至少两个数据块的重复数据删除方法与之类同, 在此 不再赘述。
步骤 202、 对各数据块进行压缩。
本步骤中, 可以借助压缩工具对步骤 201中划分的至少两个数据块进 行压缩, 下述步骤都在该压缩格式下进行, 以便进一步减少存储空间。
步骤 203、 读入已存储的热点哈希表。
本步骤中, 由于本实施例是针对热点哈希表已作为模版存储在内存的 情况下的重复数据删除方法, 因此, 在初始化哈希表之前先读入已存储的 热点哈希表。
步骤 204、 初始化哈希表。
本步骤中, 初始化哈希表具体为: 新建哈希表, 即, 定义针对每个中 间文件的哈希表。
本实施例中, 定义至少两个中间文件之中的一个中间文件的哈希表。 具体地, 将哈希表的头部信息、 待存储文件大小、 数据块大小、 偏移量等 信息存储在该新建的哈希表中。 其中, 哈希表的头部信息包括该文件的基 本信息, 如文件大小、 文件名、 文件格式等; 偏移量表示数据块在磁盘上 的具体位置信息。
步骤 205、 计算待处理文件中各数据块的数据指纹。
本步骤中, 采用 MD5算法或 SHA1算法计算待处理文件中当前中间 文件的各数据块的数据指纹。
步骤 206、 根据中间文件中各数据块的数据指纹, 更新中间文件对应 的哈希表。
本步骤中, 将中间文件中各数据块的数据指纹与当前哈希表中存储的 数据指纹进行比较, 若中间文件中数据块的数据指纹与当前哈希表中存储 的数据指纹不一致, 则将该数据指纹存储在该哈希表中, 以便最终将唯一 数据块的数据指纹保存在哈希表中。 步骤 207、 根据待处理文件中数据指纹的重复出现次数, 更新热点哈 希表。
本步骤为可选步骤, 其中, 根据待处理文件中数据指纹的重复出现次 数, 更新热点哈希表, 具体可以是仅基于待处理文件的数据指纹出现次数 进行更新, 也可以是基于历史文件和待处理文件中统计的数据指纹出现次 数, 对相同数据指纹的出现次数进行累计, 以更新热点哈希表。
本实施例中, 根据所述待处理文件中数据指纹重复出现的次数, 更新 所述热点哈希表包括:
在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数 据块的数据指纹之后, 统计各数据指纹的出现次数;
将出现次数达到设定门限值的数据指纹写入热点哈希表中。
具体地, 上述待处理文件即为当前中间文件, 设定门限值可根据经验 设定。 或者, 也可以将各数据指纹依照出现的次数排序, 然后提取出现次 数高的数据指纹并写入热点哈希表中, 以便更新作为模版存储在内存中的 原热点哈希表。
步骤 208、 根据各数据块的数据指纹和热点哈希表中的数据指纹对待 处理文件的数据块进行去重操作。
本步骤中, 所述热点哈希表中的数据指纹为在至少一个文件中重复出 现次数达到设定门限值的数据指纹。 其中, 该至少一个文件可以指当前待 处理的文件, 也可以指获取到该热点哈希表的其它文件, 例如, 已经处理 过的历史文件, 或者是历史文件和当前待处理文件二者的结合。
具体地, 根据各数据块的数据指纹和热点哈希表中的数据指纹对待处 理文件的数据块进行去重操作包括:
根据所述中间文件每个数据块的数据指纹, 在所述热点哈希表的数据 指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块; 件进行的去重操作。
一般的, 当数据块生成的数据指纹相同时, 要考虑碰撞问题, 即不同 数据块生成相同数据指纹的场景, 因此, 通过对数据块进行字节比较, 最 终确认数据块内容是否完全相同, 即是否是重复数据块。 具体地, 当中间 文件的一个数据块的数据指纹与热点哈希表的数据指纹匹配一致时, 对该 数据块和热点哈希表中该数据指纹的对应的数据块进行字节比较, 若比较 一致, 则删除该数据块, 若比较不一致, 则对该数据块的数据指纹添加标 识, 使其区别于热点哈希表中的数据指纹, 并将带有标识的数据指纹写入 哈希表中。 其中, 对数据指纹添加标识可以对数据指纹增加一个字段或采 用其他标识。 本步骤中的去重操作指的是将数据指纹和热点哈希表的数据 指纹匹配一致并且字节比较一致的数据块删除。
步骤 209、 根据中间文件的哈希表对中间文件进行去重操作, 生成新 的文件。
本步骤中, 生成的新的文件即为备份文件。 根据中间文件的哈希表对 所述中间文件进行去重操作包括:
根据所述中间文件每个数据块的数据指纹, 在所述中间文件的哈希表 的数据指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块。
一般的, 当数据块生成的数据指纹相同时, 要考虑碰撞问题, 即不同 数据块生成相同数据指纹的场景, 因此, 通过对数据块进行字节比较, 最 终确认数据块内容是否完全相同, 即是否是重复数据块。 具体地, 当中间 文件的一个数据块的数据指纹与中间文件的哈希表的数据指纹匹配一致 时, 对该数据块和哈希表中该数据指纹的对应的数据块进行字节比较, 若 比较一致, 则删除该数据块, 若比较不一致, 则对该数据块的数据指纹添 加标识, 使其区别于哈希表中的数据指纹, 并将带有标识的数据指纹写入 哈希表中。 其中, 对数据指纹添加标识可以对数据指纹增加一个字段或采 用其他标识。
此外, 当中间文件的一个数据块的数据指纹与中间文件的哈希表的数 据指纹匹配不一致时, 将该数据块的数据指纹写入中间文件的哈希表中。
一般的, 步骤 205至步骤 209循环进行, 直至中间文件去重完成。 步骤 210、 在中间文件的去重处理完成后, 清空中间文件对应的哈希 表。
本步骤中, 在当前中间文件去重处理完成后, 清空当前中间文件对应 的哈希表, 然后从步骤 204开始对两个中间文件中的另一个中间文件进行 相同流程的去重处理。
本实施例的重复数据删除方法, 通过采用热点哈希表增加了数据块内 部及文件之间重复数据的对比, 将热点哈希表和哈希表相结合进行去重操 作, 使得对待处理文件的去重操作可以考虑重复出现次数较高的数据指 纹, 特别是能够考虑在多个文件中重复出现的数据指纹, 能够在存储过程 中降低文件数据块的重复率, 提高文件存储的空间利用率; 此外, 热点哈 希表区别于一般的哈希表, 并非存储唯一数据块的数据指纹, 而是仅存储 重复次数高的数据指纹, 因此其数据量规模较小, 可减小对内存的占用; 通过对数据块的压缩处理进一步减少了存储空间; 通过将重复出现次数达 到设定门限值的数据指纹写入热点哈希表中, 达到更新热点哈希表的目 的; 通过对具有相同数据指纹的数据块进行字节比较, 实现准确判定重复 数据块的目的。
图 3为本发明重复数据删除方法实施例三的流程图, 如图 3所示, 本 实施例与图 2所示实施例的区别在于热点哈希表未作为模版存储在内存 中, 而是需要在处理过程中同时生成热点哈希表, 本实施例的方法可以包 括:
步骤 301、 将待处理文件划分成至少两个数据块。 两个数据块。 一般的, 若系统规定的数据块过大, 则会影响处理效率, 优 选的, 在系统中规定每个数据块大小为 64K。 对于待处理文件, 可以直接 将该待处理文件划分成至少两个数据块; 或者, 将该待处理文件划分成至 少两个数据块包括: 将待处理文件划分成至少两个中间文件, 将每个中间 文件划分成至少两个数据块。 件划分成至少两个数据块的重复数据删除方法为例进行详细说明, 直接将 该待处理文件划分成至少两个数据块的重复数据删除方法与之类同, 在此 不再贅述。 步骤 302、 对各数据块进行压缩。
本步骤中, 可以借助压缩工具对步骤 301中划分的至少两个数据块进 行压缩, 下述步骤都在该压缩格式下进行, 以便进一步减少存储空间。
步骤 303、 初始化热点哈希表。
本步骤中, 由于本实施例是针对热点哈希表未作为模版存储在内存 中, 而是需要在处理过程中同时生成热点哈希表的情况, 因此, 初始化热 点哈希表具体为: 新建热点哈希表, 即定义待处理文件的热点哈希表。
具体地, 定义待处理文件的热点哈希表, 即, 将热点哈希表的头部信 息、 待存储文件大小、 数据块大小、 偏移量等信息存储在该新建的热点哈 希表中。 其中, 热点哈希表的头部信息包括该文件的基本信息, 如文件大 小、 文件名、 文件格式等; 偏移量表示数据块在磁盘上的具体位置信息。
步骤 304、 初始化哈希表。
本步骤中, 初始化哈希表具体为: 新建哈希表, 即, 定义针对每个中 间文件的哈希表。
本实施例中, 定义至少两个中间文件之中的一个中间文件的哈希表。 具体地, 将哈希表的头部信息、 待存储文件大小、 数据块大小、 偏移量等 信息存储在该新建的哈希表中。 其中, 哈希表的头部信息包括该文件的基 本信息, 如文件大小、 文件名、 文件格式等; 偏移量表示数据块在磁盘上 的具体位置信息。
步骤 305、 计算待处理文件中各数据块的数据指纹。
本步骤中, 采用 MD5算法或 SHA1算法计算待处理文件中当前中间 文件的各数据块的数据指纹。
步骤 306、 根据中间文件中各数据块的数据指纹, 更新中间文件对应 的哈希表。
本步骤中, 将中间文件中各数据块的数据指纹与当前哈希表中存储的 数据指纹进行比较, 若中间文件中数据块的数据指纹与当前哈希表中存储 的数据指纹不一致, 则将该数据指纹存储在该哈希表中, 以便最终将唯一 数据块的数据指纹保存在哈希表中。
步骤 307、 根据待处理文件中数据指纹的重复出现次数, 更新热点哈 希表。 本步骤中, 可有通过查询哈希表, 将数据指纹所指向的重复数据块的 个数大于阀值的数据指纹写入热点哈希表中; 或者在计算待处理文件中各 数据块的数据指纹之后, 或在计算每个数据块的数据指纹之后, 统计各数 据指纹的出现次数; 将出现次数达到设定门限值的数据指纹写入热点哈希 表中。
具体地,通过查询哈希表获得热点哈希表具体为:根据经验确定阀值, 然后, 若哈希表中的某个数据指纹指向重复数据块的个数大于该阀值, 则 系统将该数据指纹写入热点哈希表中, 则热点哈希表就存入了热点数据块 所对应的数据指纹, 该热点哈希表存储在内存中, 可应用在后续文件的重 复数据删除操作中。 此外, 也可通过数据指纹出现次数获得热点哈希表。 并将该热点哈希表作为模版存储在内存中。
步骤 308、 根据各数据块的数据指纹和热点哈希表中的数据指纹对待 处理文件的数据块进行去重操作。
本步骤中, 所述热点哈希表中的数据指纹为在至少一个文件中重复出 现次数达到设定门限值的数据指纹。 其中, 该至少一个文件指当前待处理 的文件。
具体地, 根据各数据块的数据指纹和热点哈希表中的数据指纹对待处 理文件的数据块进行去重操作包括:
根据所述中间文件每个数据块的数据指纹, 在所述热点哈希表的数据 指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块; 件进行的去重操作。
一般的, 当数据块生成的数据指纹相同时, 要考虑碰撞问题, 即不同 数据块生成相同数据指纹的场景, 因此, 通过对数据块进行字节比较, 最 终确认数据块内容是否完全相同, 即是否是重复数据块。 具体地, 当中间 文件的一个数据块的数据指纹与热点哈希表的数据指纹匹配一致时, 对该 数据块和热点哈希表中该数据指纹的对应的数据块进行字节比较, 若比较 一致, 则删除该数据块, 若比较不一致, 则对该数据块的数据指纹添加标 识, 使其区别于热点哈希表中的数据指纹, 并将带有标识的数据指纹写入 哈希表中。 其中, 对数据指纹添加标识可以对数据指纹增加一个字段或采 用其他标识。 本步骤中的去重操作指的是将数据指纹和热点哈希表的数据 指纹匹配一致并且字节比较一致的数据块删除。
步骤 309、 根据中间文件的哈希表对中间文件进行去重操作, 生成新 的文件。
本步骤中, 生成的新的文件即为备份文件。 根据中间文件的哈希表对 所述中间文件进行去重操作包括:
根据所述中间文件每个数据块的数据指纹, 在所述中间文件的哈希表 的数据指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块。
一般的, 当数据块生成的数据指纹相同时, 要考虑碰撞问题, 即不同 数据块生成相同数据指纹的场景, 因此, 通过对数据块进行字节比较, 最 终确认数据块内容是否完全相同, 即是否是重复数据块。 具体地, 当中间 文件的一个数据块的数据指纹与中间文件的哈希表的数据指纹匹配一致 时, 对该数据块和哈希表中该数据指纹的对应的数据块进行字节比较, 若 比较一致, 则删除该数据块, 若比较不一致, 则对该数据块的数据指纹添 加标识, 使其区别于哈希表中的数据指纹, 并将带有标识的数据指纹写入 哈希表中。 其中, 对数据指纹添加标识可以对数据指纹增加一个字段或采 用其他标识。
此外, 当中间文件的一个数据块的数据指纹与中间文件的哈希表的数 据指纹匹配不一致时, 将该数据块的数据指纹写入中间文件的哈希表中。
一般的, 步骤 305至步骤 309循环进行, 直至中间文件去重完成。 步骤 3 10、 在中间文件的去重处理完成后, 清空中间文件对应的哈希 表。
本步骤中, 在当前中间文件去重处理完成后, 清空当前中间文件对应 的哈希表,此时,对于两个中间文件中的另一个中间文件进行去重处理时, 相当于本发明重复数据删除方法实施例二中针对热点哈希表已作为模版 存储在内存的情况下的重复数据删除方法, 因此, 可采用实施例二中所述 流程进行去重处理, 在此不再贅述。
本实施例的重复数据删除方法, 在进行重复数据删除的同时生成热 点哈希表, 然后通过采用热点哈希表增加了数据块内部及文件之间重复数 据的对比, 将热点哈希表和哈希表相结合进行去重操作, 使得对待处理文 件的去重操作可以考虑重复出现次数较高的数据指纹, 特别是能够考虑在 多个文件中重复出现的数据指纹, 能够在存储过程中降低文件数据块的重 复率, 提高文件存储的空间利用率; 此外, 热点哈希表区别于一般的哈希 表,并非存储唯一数据块的数据指纹,而是仅存储重复次数高的数据指纹, 因此其数据量规模较小, 可减小对内存的占用; 通过对数据块的压缩处理 进一步减少了存储空间; 通过将数据指纹所指向的重复数据块的个数大于 阀值的数据指纹或重复出现次数达到设定门限值的数据指纹写入热点哈 希表中, 达到更新热点哈希表的目的; 通过对具有相同数据指纹的数据块 进行字节比较, 实现准确判定重复数据块的目的。
相应的, 利用本发明实施例的重复数据删除技术备份的数据, 在恢复 数据时, 可以根据恢复文件的特性, 提取备份的热点数据, 恢复时将热点 数据存入内存及緩存中, 提高恢复数据的效率。
图 4为本发明重复数据删除装置实施例一的结构示意图,如图 4所示, 本实施例的装置可以包括: 数据块划分模块 11、 计算模块 12和第一去重 模块 13。 其中, 数据块划分模块 1 1 , 用于将待处理文件划分成至少两个 数据块; 计算模块 12用于计算所述待处理文件中各数据块的数据指纹; 第一去重模块 13 ,用于根据各数据块的数据指纹和热点哈希表中的数据指 纹对所述待处理文件的数据块进行去重操作, 其中, 所述热点哈希表中的 数据指纹为在至少一个文件中重复出现次数达到设定门限值的数据指纹。
本实施例的装置, 可以用于执行图 1所示方法实施例的技术方案, 其 实现原理和技术效果类似, 此处不再贅述。
图 5为本发明重复数据删除装置实施例二的结构示意图,如图 5所示, 本实施例的装置在图 4所示装置结构的基础上, 进一步地, 还可以包括: 热点哈希表更新模块 14、 压缩模块 15、 更新模块 16、 第二去重模块 17 和清空模块 18。 其中, 数据块划分模块 11具体用于将待处理文件划分成 至少两个中间文件, 将每个中间文件划分成至少两个数据块; 热点哈希表 更新模块 14用于在计算所述待处理文件中各数据块的数据指纹之后, 根 据所述待处理文件中数据指纹重复出现的次数, 更新所述热点哈希表; 压 缩模块 15用于在计算所述待处理文件中各数据块的数据指纹之前, 对各 数据块进行压缩; 更新模块 16用于在计算所述待处理文件中各数据块的 数据指纹之后, 根据每个所述中间文件中各数据块的数据指纹, 更新所述 中间文件对应的哈希表; 第二去重模块 17用于根据所述中间文件的哈希 表对所述中间文件进行去重操作; 清空模块 18用于在所述中间文件的去 重处理完成后, 清空所述中间文件对应的哈希表。
本实施例的装置, 可以用于执行图 2或图 3所示方法实施例的技术方 案, 其实现原理和技术效果类似, 此处不再贅述。
图 6为本发明重复数据删除装置实施例三的结构示意图,如图 6所示, 本实施例的装置在图 5所示装置结构的基础上, 进一步地, 热点哈希表更 新模块 14可以包括: 统计单元 141和写入单元 142。 其中, 统计单元 141 用于在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数 据块的数据指纹之后, 统计各数据指纹的出现次数; 写入单元 142 , 用于 将出现次数达到设定门限值的数据指纹写入热点哈希表中。
第一去重模块 13可以包括: 第一匹配单元 131、 第一删除单元 132 和触发单元 133。 其中, 第一匹配单元 131 , 用于在根据所述中间文件的 哈希表对所述中间文件进行去重操作之前, 根据所述中间文件每个数据块 的数据指纹, 在所述热点哈希表的数据指纹中进行匹配; 第一删除单元 132 , 用于当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数 据块进行字节比较, 若比较一致, 则删除所述数据块; 触发单元 133 , 用 于当匹配不一致时, 触发所述第二去重模块根据所述中间文件的哈希表对 所述中间文件进行的去重操作。
第二去重模块 17可以包括: 第二匹配单元 171和比较删除单元 172。 其中, 第二匹配单元 171 , 用于根据所述中间文件每个数据块的数据指纹, 在所述中间文件的哈希表的数据指纹中进行匹配; 比较删除单元 172 , 用 于当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进行 字节比较, 若比较一致, 则删除所述数据块。
本实施例的装置, 可以用于执行图 2或图 3所示方法实施例的技术方 案, 其实现原理和技术效果类似, 此处不再贅述。
本发明实施例提供的重复数据删除方法和装置, 可以应用于备份批量 文件中, 通过采用热点哈希表增加了数据块内部及文件之间重复数据的对 比, 提高了文件的空间节省率。 同时本发明实施例也适用于前端重复数据 删除及后端重复数据删除, 本地数据备份及远程数据备份, 以及虚拟化环 境中。 在虚拟化环境中, 对批量虚拟机进行全量和增量备份。 例如, 对于 桌面云系统, 由于它所管理的虚拟机的操作系统及应用软件有很多相同文 件, 应用热点哈希表能快速并有效的对批量虚拟机进行全量备份, 并且会 大大提高文件的空间节省率。
图 7 为本发明重复数据删除装置实施例四的加密设备的装置结构示意 图。 本发明具体实施例并不对所述网络设备的具体实现做限定。 如图 7所示, 本实施例的加密设备包括处理器 (processor)2101、 通信接口(Communications Interface)2102、 存储器 (memory)2103以及总线 2104。
其中, 处理器 2101、 通信接口 2102、 存储器 2103通过总线 2104完成相 互间的通信; 通信接口 2102, 用于与其他设备进行通信; 处理器 2101 , 用于 执行程序 A。
具体地,程序 A可以包括程序代码,所述程序代码包括计算机操作指令。 处理器 2101 可能是一个中央处理器 CPU, 或者是特定集成电路 ASIC ( Application Specific Integrated Circuit ) , 或者是被配置成实施本发明实施例 的一个或多个集成电路。
存储器 2103 ,用于存放程序 A。存储器 2103可能包含高速 RAM存储器, 也可能还包括非易失性存储器( non- volatile memory ) , 例如至少一个磁盘存 储器。 程序 Α具体可以包括:
将待处理文件划分成至少两个数据块;
计算所述待处理文件中各数据块的数据指纹;
根据各数据块的数据指纹和热点哈希表中的数据指纹对所述待处理 文件的数据块进行去重操作, 其中, 所述热点哈希表中的数据指纹为在至 少一个文件中重复出现次数达到设定门限值的数据指纹。
上述程序 A, 优选的是在计算所述待处理文件中各数据块的数据指纹 之后, 还包括: 根据所述待处理文件中数据指纹重复出现的次数, 更新所 述热点哈希表。
上述程序 A, 优选的是根据所述待处理文件中数据指纹重复出现的次 数, 更新所述热点哈希表包括:
在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数 据块的数据指纹之后, 统计各数据指纹的出现次数;
将出现次数达到设定门限值的数据指纹写入热点哈希表中。
上述程序 A, 优选的是将待处理文件划分成至少两个数据块包括: 将 待处理文件划分成至少两个中间文件, 将每个中间文件划分成至少两个数 据块;
则计算所述待处理文件中各数据块的数据指纹之后, 还包括: 根据每个所述中间文件中各数据块的数据指纹, 更新所述中间文件对 应的哈希表; 在所述中间文件的去重处理完成后, 清空所述中间文件对应的哈希 表。
上述程序 A, 优选的是根据各数据块的数据指纹和热点哈希表中的数 据指纹对所述待处理文件的数据块进行去重操作包括: 据所述中间文件每个数据块的数据指纹, 在所述热点哈希表的数据指纹中 进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块;
当匹配不一致时,
件进行的去重操作。
上述程序 A, 优选的是?
行去重操作包括:
根据所述中间文件每个数据块的数据指纹, 在所述中间文件的哈希表 的数据指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块。 上述程序 A, 优选的是计算所述待处理文件中各数据块的数据指纹之 前, 还包括: 对各数据块进行压缩。
本领域普通技术人员可以理解: 实现上述各方法实施例的全部或部分 步骤可以通过程序指令相关的硬件来完成。 前述的程序可以存储于一计算 机可读取存储介质中。 该程序在执行时, 执行包括上述各方法实施例的步 骤; 而前述的存储介质包括: ROM、 RAM, 磁碟或者光盘等各种可以存 储程序代码的介质。
最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非 对其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的 普通技术人员应当理解: 其依然可以对前述各实施例所记载的技术方案进 行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或 者替换, 并不使相应技术方案的本质脱离本发明各实施例技术方案的范 围。

Claims

权 利 要 求 书
1、 一种重复数据删除方法, 其特征在于, 包括:
将待处理文件划分成至少两个数据块;
计算所述待处理文件中各数据块的数据指纹;
根据各数据块的数据指纹和热点哈希表中的数据指纹对所述待处理 文件的数据块进行去重操作, 其中, 所述热点哈希表中的数据指纹为在至 少一个文件中重复出现次数达到设定门限值的数据指纹。
2、 根据权利要求 1所述的重复数据删除方法, 其特征在于, 在计算 所述待处理文件中各数据块的数据指纹之后, 还包括:
根据所述待处理文件中数据指纹重复出现的次数, 更新所述热点哈希 表。
3、 根据权利要求 2所述的重复数据删除方法, 其特征在于, 根据所 述待处理文件中数据指纹重复出现的次数, 更新所述热点哈希表包括: 在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数 据块的数据指纹之后, 统计各数据指纹的出现次数;
将出现次数达到设定门限值的数据指纹写入热点哈希表中。
4、 根据权利要求 1所述的重复数据删除方法, 其特征在于: 将待处理文件划分成至少两个数据块包括: 将待处理文件划分成至少 两个中间文件, 将每个中间文件划分成至少两个数据块;
则计算所述待处理文件中各数据块的数据指纹之后, 还包括: 根据每个所述中间文件中各数据块的数据指纹, 更新所述中间文件对 应的哈希表; 在所述中间文件的去重处理完成后, 清空所述中间文件对应的哈希 表。
5、 根据权利要求 4所述的重复数据删除方法, 其特征在于, 根据各 数据块的数据指纹和热点哈希表中的数据指纹对所述待处理文件的数据 块进行去重操作包括: 据所述中间文件每个数据块的数据指纹, 在所述热点哈希表的数据指纹中 进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块; 件进行的去重操作。
6、 根据权利要求 4或 5所述的重复数据删除方法, 其特征在于, 根 根据所述中间文件每个数据块的数据指纹, 在所述中间文件的哈希表 的数据指纹中进行匹配;
当匹配一致时, 对所述数据块与匹配一致数据指纹所对应的数据块进 行字节比较, 若比较一致, 则删除所述数据块。
7、根据权利要求 1-5任一所述的重复数据删除方法, 其特征在于, 计 算所述待处理文件中各数据块的数据指纹之前, 还包括: 对各数据块进行 压缩。
8、 一种重复数据删除装置, 其特征在于, 包括:
数据块划分模块, 用于将待处理文件划分成至少两个数据块; 计算模块, 用于计算所述待处理文件中各数据块的数据指纹; 第一去重模块, 用于根据各数据块的数据指纹和热点哈希表中的数据 指纹对所述待处理文件的数据块进行去重操作, 其中, 所述热点哈希表中 的数据指纹为在至少一个文件中重复出现次数达到设定门限值的数据指 纹。
9、 根据权利要求 8所述的装置, 其特征在于, 还包括:
热点哈希表更新模块, 用于在计算所述待处理文件中各数据块的数据 指纹之后, 根据所述待处理文件中数据指纹重复出现的次数, 更新所述热 点哈希表。
10、 根据权利要求 9所述的装置, 其特征在于, 所述热点哈希表更新 模块包括:
统计单元, 用于在计算所述待处理文件中各数据块的数据指纹之后, 或在计算每个数据块的数据指纹之后, 统计各数据指纹的出现次数;
写入单元, 用于将出现次数达到设定门限值的数据指纹写入热点哈希 表中。
1 1、 根据权利要求 8所述的重复数据删除装置, 其特征在于: 每个中间文件划分成至少两个数据块;
所述装置还包括:
更新模块, 用于在计算所述待处理文件中各数据块的数据指纹之后, 根据每个所述中间文件中各数据块的数据指纹, 更新所述中间文件对应的 哈希表; 去重操作;
清空模块, 用于在所述中间文件的去重处理完成后, 清空所述中间文 件对应的哈希表。
12、 根据权利要求 11所述的重复数据删除装置, 其特征在于, 所述 第一去重模块包括: 行去重操作之前, 根据所述中间文件每个数据块的数据指纹, 在所述热点 哈希表的数据指纹中进行匹配;
第一删除单元, 用于当匹配一致时, 对所述数据块与匹配一致数据指 纹所对应的数据块进行字节比较, 若比较一致, 则删除所述数据块;
触发单元, 用于当匹配不一致时, 触发所述第二去重模块根据所述中 间文件的哈希表对所述中间文件进行的去重操作。
13、 根据权利要求 11或 12所述的重复数据删除装置, 其特征在于, 所述第二去重模块包括:
第二匹配单元, 用于根据所述中间文件每个数据块的数据指纹, 在所 述中间文件的哈希表的数据指纹中进行匹配;
比较删除单元, 用于当匹配一致时, 对所述数据块与匹配一致数据指 纹所对应的数据块进行字节比较, 若比较一致, 则删除所述数据块。
14、 根据权利要求 8或 12所述的重复数据删除装置, 其特征在于, 还包括:
压缩模块, 用于在计算所述待处理文件中各数据块的数据指纹之前, 对各数据块进行压缩。
PCT/CN2013/084542 2012-12-18 2013-09-27 重复数据删除方法和装置 WO2014094479A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210552244.1A CN103870514B (zh) 2012-12-18 2012-12-18 重复数据删除方法和装置
CN201210552244.1 2012-12-18

Publications (1)

Publication Number Publication Date
WO2014094479A1 true WO2014094479A1 (zh) 2014-06-26

Family

ID=50909055

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/084542 WO2014094479A1 (zh) 2012-12-18 2013-09-27 重复数据删除方法和装置

Country Status (2)

Country Link
CN (1) CN103870514B (zh)
WO (1) WO2014094479A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241615A (zh) * 2016-12-23 2018-07-03 中国电信股份有限公司 数据去重方法和装置

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077380B (zh) * 2014-06-26 2017-07-18 深圳信息职业技术学院 一种重复数据删除方法、装置及系统
CN104317823B (zh) * 2014-09-30 2016-03-16 北京艾秀信安科技有限公司 一种利用数据指纹进行数据检测的方法
CN104407982B (zh) * 2014-11-19 2018-09-21 湖南国科微电子股份有限公司 一种ssd盘片垃圾回收方法
US20160253096A1 (en) * 2015-02-28 2016-09-01 Altera Corporation Methods and apparatus for two-dimensional block bit-stream compression and decompression
CN106610790B (zh) * 2015-10-26 2020-01-03 华为技术有限公司 一种重复数据删除方法及装置
CN105488144A (zh) * 2015-11-25 2016-04-13 四川诚品电子商务有限公司 商品评论信息中重复信息处理方法
CN108228083A (zh) * 2016-12-21 2018-06-29 伊姆西Ip控股有限责任公司 用于数据去重的方法和设备
CN106990914B (zh) * 2017-02-17 2020-06-12 北京同有飞骥科技股份有限公司 数据删除方法及装置
CN107391034B (zh) * 2017-07-07 2019-05-10 华中科技大学 一种基于局部性优化的重复数据检测方法
US10671306B2 (en) * 2018-06-06 2020-06-02 Yingquan Wu Chunk-based data deduplication
CN108984123A (zh) * 2018-07-12 2018-12-11 郑州云海信息技术有限公司 一种重复数据删除方法和装置
CN111198857A (zh) * 2018-10-31 2020-05-26 深信服科技股份有限公司 一种基于全闪存阵列的数据压缩方法及系统
CN109885574B (zh) * 2019-02-22 2020-05-05 广州荔支网络技术有限公司 一种数据查询方法及装置
CN110109617B (zh) * 2019-04-22 2020-05-12 电子科技大学 一种加密重复数据删除系统中的高效元数据管理方法
CN110096483B (zh) * 2019-05-08 2021-04-30 北京奇艺世纪科技有限公司 一种重复文件检测方法、终端和服务器
CN110618789B (zh) * 2019-08-14 2021-08-20 华为技术有限公司 一种重复数据的删除方法及装置
CN112559452B (zh) * 2020-12-11 2021-12-17 北京云宽志业网络技术有限公司 数据去重处理方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079034A (zh) * 2006-07-10 2007-11-28 腾讯科技(深圳)有限公司 消除文件存储系统中冗余文件的系统及方法
US20100235333A1 (en) * 2009-03-16 2010-09-16 International Business Machines Corporation Apparatus and method to sequentially deduplicate data
CN102629247A (zh) * 2011-12-31 2012-08-08 成都市华为赛门铁克科技有限公司 一种数据处理方法、装置和系统
CN102741800A (zh) * 2009-09-18 2012-10-17 株式会社日立制作所 删除复制数据的存储系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385554B (zh) * 2011-10-28 2014-01-15 华中科技大学 重复数据删除系统的优化方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079034A (zh) * 2006-07-10 2007-11-28 腾讯科技(深圳)有限公司 消除文件存储系统中冗余文件的系统及方法
US20100235333A1 (en) * 2009-03-16 2010-09-16 International Business Machines Corporation Apparatus and method to sequentially deduplicate data
CN102741800A (zh) * 2009-09-18 2012-10-17 株式会社日立制作所 删除复制数据的存储系统
CN102629247A (zh) * 2011-12-31 2012-08-08 成都市华为赛门铁克科技有限公司 一种数据处理方法、装置和系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241615A (zh) * 2016-12-23 2018-07-03 中国电信股份有限公司 数据去重方法和装置

Also Published As

Publication number Publication date
CN103870514A (zh) 2014-06-18
CN103870514B (zh) 2018-03-09

Similar Documents

Publication Publication Date Title
WO2014094479A1 (zh) 重复数据删除方法和装置
CN108427538B (zh) 全闪存阵列的存储数据压缩方法、装置、及可读存储介质
CN108427539B (zh) 缓存设备数据的离线去重压缩方法、装置及可读存储介质
CN103098035B (zh) 存储系统
US11334255B2 (en) Method and device for data replication
US8782011B2 (en) System and method for scalable reference management in a deduplication based storage system
EP3376393B1 (en) Data storage method and apparatus
US11232073B2 (en) Method and apparatus for file compaction in key-value store system
JP6537214B2 (ja) 重複排除方法および記憶デバイス
WO2013127309A1 (zh) 数据处理方法及数据处理设备
CN111125033B (zh) 一种基于全闪存阵列的空间回收方法及系统
WO2013086969A1 (zh) 重复数据查找方法、装置及系统
CN110998537B (zh) 一种过期备份处理方法及备份服务器
CN105912268B (zh) 一种基于自匹配特征的分布式重复数据删除方法及其装置
WO2014067063A1 (zh) 重复数据检索方法及设备
WO2014201696A1 (zh) 一种文件读取方法、存储设备及读取系统
WO2013163813A1 (zh) 重复数据删除方法及装置
CN112612576B (zh) 虚拟机备份方法、装置、电子设备及存储介质
US10346256B1 (en) Client side cache for deduplication backup systems
US8909606B2 (en) Data block compression using coalescion
WO2015096847A1 (en) Method and apparatus for context aware based data de-duplication
CN111124940B (zh) 一种基于全闪存阵列的空间回收方法及系统
EP3432168B1 (en) Metadata separated container format
CN111124939A (zh) 一种基于全闪存阵列的数据压缩方法及系统
CN111124259A (zh) 一种基于全闪存阵列的数据压缩方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13865396

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13865396

Country of ref document: EP

Kind code of ref document: A1