WO2014067063A1 - Procédé et dispositif de récupération de données en double - Google Patents

Procédé et dispositif de récupération de données en double Download PDF

Info

Publication number
WO2014067063A1
WO2014067063A1 PCT/CN2012/083740 CN2012083740W WO2014067063A1 WO 2014067063 A1 WO2014067063 A1 WO 2014067063A1 CN 2012083740 W CN2012083740 W CN 2012083740W WO 2014067063 A1 WO2014067063 A1 WO 2014067063A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
hash value
data packet
block
hash
Prior art date
Application number
PCT/CN2012/083740
Other languages
English (en)
Chinese (zh)
Inventor
覃强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2012/083740 priority Critical patent/WO2014067063A1/fr
Priority to CN201280001989.7A priority patent/CN103189867B/zh
Publication of WO2014067063A1 publication Critical patent/WO2014067063A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values

Definitions

  • the present invention relates to storage technologies, and in particular, to a method and device for repetitive data retrieval. Background technique
  • Deduplication (De-duplication in English) is a data reduction technology designed to reduce the storage capacity used in storage systems or reduce the amount of data transmitted over the network. It is widely used in data backup or WAN data transmission scenarios.
  • the process of deduplication is as follows: the input data is divided into blocks, the hash value of each block is calculated, and the calculated hash value is searched in the single instance library to determine whether the block is a duplicate block. In order to repeat the block, the block and its hash value are not stored in the single instance library, so as to reduce the data.
  • the embodiment of the invention provides a method and device for repetitive data retrieval, which is used to improve the efficiency of repeated block query and improve the overall performance of the data deduplication technology.
  • the first aspect provides a method for repetitive data retrieval, including:
  • Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet acquiring a hash value of the first data packet, and obtaining a hash value storage a first hash value in the table that is similar to a hash value of the first data packet that is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space.
  • the hash value of the second data packet is performing similarity hashing according to data partitioning in the second data packet Obtaining;
  • the first data packet is any one of the at least one data packet; if a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset a second similarity threshold for performing a repeated block retrieval on the data partitioning within the first data packet.
  • the method for retrieving the data block further includes: if a similarity between a hash value of the first data packet and the first hash value is less than the first a second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and grouping the first data packet The correspondence between the hash value and the first data packet is stored in the hash value storage table.
  • the at least two data partitions are grouped, and the obtaining the at least one data packet includes: Forming, by the hash value of each of the at least two data blocks, the hash data to be blocked; the length of the hash value of any one of the data blocks is a sliding step, and the block is used
  • the algorithm performs block processing on the to-be-blocked hash data to obtain at least one hash value block; and blocks the data corresponding to the hash value of the same hash value block as one of the data packets.
  • the first data packet The hashing of the first data packet includes: hashing each data chunk in the first data packet, and acquiring the first data packet a hash value of each data block in the data; replacing 0 of the hash value of each data block in the first data packet with -1, and partitioning all the data in the first data packet.
  • the corresponding bits of the hash value are added, the bits added by greater than 0 are mapped to 1, and the bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located Relationship
  • Performing a repetitive block search on the data block in the first data packet includes: obtaining, from the hash value storage table, a number n of the storage area corresponding to the first hash value, and corresponding to the storage area in the number n The data block and the hash value of the data block are loaded into the memory; wherein n is greater than or equal to 0 An integer of the first data packet is compared with a data block having the same hash value in the storage area corresponding to the number n to complete a repeated block retrieval of the data partition in the first data packet.
  • the method further includes: dividing the data in the storage area corresponding to the number n and the data into blocks When the hash value is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
  • comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes: Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the block of data in the first data packet. Repeat block retrieval.
  • the obtaining a hash value storage table is similar to a hash value of the first data packet
  • the first hash value that is greater than or equal to the preset first similarity threshold includes: obtaining the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than Or equal to a preset number of hash values as the first hash value.
  • the acquiring a hash value storage table is corresponding to a hash value of the first data packet
  • the second aspect provides a duplicate data retrieval device, including:
  • a block obtaining module configured to perform block processing on the received data to obtain at least two data blocks
  • a packet obtaining module configured to group the at least two data blocks obtained by the block obtaining module to obtain at least one data packet, each data packet includes at least one data block; and a hash calculation module, For the first data packet in the at least one data packet, The data segmentation in the first data packet performs a similarity hash operation, obtains a hash value of the first data packet, and obtains a hash value similarity with the first data packet in the hash value storage table.
  • the hash value storage table stores a hash value of the second data packet that has been stored in the data storage space and the second data Corresponding relationship of the group, the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet; the first data packet is the at least one data packet Any one of the data packets;
  • a repeating search module configured to: when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold, Data block is used for repeated block retrieval.
  • the data retrieving device further includes: a storage module, configured, in a similarity between a hash value of the first data packet and the first hash value When less than the second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and A correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table.
  • a storage module configured, in a similarity between a hash value of the first data packet and the first hash value When less than the second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and A correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table.
  • the packet acquiring module is specifically configured to be used by each of the at least two data blocks
  • the hash value of the data block constitutes the block hash data
  • the length of the hash value of any one of the data blocks is a sliding step size
  • the block data is used to perform the block hash data.
  • Block processing obtaining at least one hash value block, and dividing the data block corresponding to the hash value belonging to the same hash value block as one of the data packets.
  • Performing a similarity hash operation on the data partition in the first data packet, and obtaining a hash value of the first data packet includes:
  • the hash calculation module is configured to perform a hash operation on each data block in the first data packet, and obtain a hash value of each data block in the first data packet, where the first The 0 of the hash value of each data block in the data packet is replaced by -1, and the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bit maps greater than 0 are added. Is 1 and will add less than A bit map of or equal to 0 is 0, and the obtained binary value is used as a hash value of the first data packet.
  • the data storage space includes a plurality of storage areas;
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located relationship;
  • the repeating retrieval module is specifically configured to obtain, from the hash value storage table, the number n of the storage area corresponding to the first hash value, and the number n corresponding to the data partitioning and data partitioning in the storage area.
  • the value is loaded into the memory; wherein, n is an integer greater than or equal to 0; comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete the A repeated block retrieval of data chunks within the first data packet.
  • the repeatedly retrieving module is further configured to divide data and data in the storage area corresponding to the number n When the hash value of the block is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
  • the repeatedly retrieving module is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet.
  • the repeated block retrieval includes: the repeated retrieval module is specifically configured to compare the first data points to complete a repeated block retrieval of data blocks within the first data packet.
  • the hash calculation module is configured to obtain a hash value storage table and the first data
  • the first hash value of the hash value similarity of the group is greater than or equal to the preset first similarity threshold.
  • the hash calculation module is specifically configured to obtain the first data in the hash value storage table.
  • the number of repeated bits in the corresponding position of the hash value of the group is greater than or equal to a preset number of hash values as the first hash value.
  • the hash computing module is specifically configured to obtain the hash value storage table and the first The hash value calculation module is specifically configured to acquire the first one, where the number of the repeated bits in the corresponding position of the hash value of the data packet is greater than or equal to the preset number of hash values.
  • a hash value is used as the first hash value.
  • a third aspect provides a repetitive data retrieval device, including: a processor, a communication interface, a memory, and a bus: the processor, the communication interface, and the memory complete communication with each other through the bus;
  • the communication interface is configured to receive data
  • the processor is configured to execute a program
  • the memory is configured to store the program
  • the program is configured to perform block processing on the data received by the communication interface, to obtain at least two data blocks, and group the at least two data blocks to obtain at least one data packet.
  • Data packets comprise at least one data block; for a first data packet in the at least one data packet, performing a similarity hash operation on the data block in the first data packet to obtain the first data packet a hash value, the first hash value of the hash value storage table that is similar to the hash value of the first data packet is greater than or equal to a preset first similarity threshold, and the hash value storage table is obtained.
  • the program is further configured to: the similarity between the hash value of the first data packet and the first hash value is less than the second similarity threshold And storing a data block in the first data packet and a hash value of the data block in the first data packet into the data storage space, and hashing the first data packet A correspondence between the value and the first data packet is stored in the hash value storage table.
  • the program is used to group the at least two data blocks,
  • the at least one data packet includes: the program is specifically configured to form a to-be-blocked hash data by a hash value of each of the at least two data blocks, to block the data block
  • the length of the hash value is a sliding step size, and the block data is subjected to block processing by using a blocking algorithm to obtain at least one hash value block, and the hash value corresponding to the same hash value block is corresponding.
  • the data is chunked as one of the data packets.
  • the data segmentation in the first data packet performs a similarity hash operation
  • obtaining the hash value of the first data packet includes: the program is specifically configured to perform each data segmentation in the first data packet a hash operation, obtaining a hash value of each data block in the first data packet, replacing 0 in a hash value of each data block in the first data packet with -1, The corresponding bits of the hash values of all the data blocks in the first data packet are added, the bits added by greater than 0 are mapped to 1, and the bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the The hash value of the first data packet.
  • the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located Relationship
  • And performing, by the program, the repeated block search of the data block in the first data packet includes: the program is specifically configured to obtain, from the hash value storage table, a number of the storage area corresponding to the first hash value n, loading the hash value of the data partition and the data chunk in the storage area corresponding to the number n into the memory; wherein n is an integer greater than or equal to 0; corresponding to the number n in the first data packet Data chunks having the same hash value in the storage area are compared to complete a repeated block search of the data chunks within the first data packet.
  • the program is further configured to block the data and block the data in the storage area corresponding to the number n
  • the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory
  • the program is specifically configured to: in the storage area corresponding to the number n in the first data packet Comparing the data blocks having the same hash value for comparison to complete the repeated block retrieval of the data block in the first data packet includes: the program is specifically configured to use the number n in the first data packet The data block with the same hash value in the storage area corresponding to the number (n+1) is compared to complete the repeated block retrieval of the data block in the first data packet.
  • the program is used to obtain a hash value storage table and the first data packet
  • the first hash value whose hash similarity is greater than or equal to the preset first similarity threshold includes: the program is specifically configured to obtain a hash value corresponding to the first data packet in the hash value storage table
  • the number of repeated bits in the position is greater than or equal to a preset number of hash values as the first hash value.
  • the number of the repeated bits in the seventh possible real hash value corresponding position of the third aspect is greater than or equal to the preset number of hash values as the
  • the first hash value includes: the program is specifically configured to obtain a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, A hash value in the hash value storage table whose clear distance is less than or equal to the preset Hamming distance threshold is used as the first hash value.
  • a fourth aspect provides a computer program product comprising a computer readable storage medium for storing a program, the program comprising:
  • a block obtaining unit configured to perform block processing on the received data to obtain at least two data blocks
  • a packet obtaining unit configured to group the at least two data blocks obtained by the block obtaining unit, to obtain at least one data packet, each data packet includes at least one data block; and a hash computing unit, Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space.
  • the hash value of the second data packet is a similarity hash operation according to the data partitioning in the second data packet Obtained;
  • the first data packet is in the at least one data packet Any one of the data packets;
  • a retrieving unit configured to: in the first data packet, when a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold Data block is used for repeated block retrieval.
  • the program further includes: a storage unit, configured to: when a similarity between a hash value of the first data packet and the first hash value is less than a second similarity threshold, storing a data partition in the first data packet and a hash value of the data partition in the first data packet into the data storage space, and the first A correspondence between the hash value of the data packet and the first data packet is stored in the hash value storage table.
  • the hash value of each data block in the second part of the fourth aspect constitutes the block hash data to be any one of the data blocks
  • the length of the hash value is a sliding step size
  • the block data is subjected to block processing by using a blocking algorithm to obtain at least one hash value block, and the hash belonging to the same hash value block is hashed.
  • the data corresponding to the value is divided into one of the data packets.
  • Performing a similarity hashing on the data partitioning in the first data packet, and obtaining the hash value of the first data packet includes: the hash computing unit is specifically configured to be used in the first data packet Each data block performs a hash operation, obtains a hash value of each data block in the first data packet, and replaces 0 of the hash value of each data block in the first data packet with -1, adding corresponding bits of hash values of all data blocks in the first data packet, mapping bits added by greater than 0 to 1 , and mapping bits added less than or equal to 0 to 0, obtaining The binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas;
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located relationship;
  • the number n of the corresponding storage area is loaded into the memory by the number n corresponding to the data block and the data block in the storage area; wherein n is an integer greater than or equal to 0;
  • a data block having the same hash value in the storage area corresponding to the number n is compared to complete a repeated block search of the data block in the first data packet.
  • the repeatedly retrieving unit is further configured to perform data partitioning and data division in the storage area corresponding to the number n When the hash value of the block is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
  • the repeatedly retrieving unit is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet.
  • the repeated block retrieval includes: the repeated retrieval unit is specifically configured to compare the first data points to complete a repeated block retrieval of data blocks within the first data packet.
  • the hash computing unit is configured to obtain a hash value storage table and the first data
  • the first hash value of the hash value similarity of the group is greater than or equal to the preset first similarity threshold.
  • the hash calculation unit is specifically configured to acquire the first data in the hash value storage table.
  • the number of repeated bits in the corresponding position of the hash value of the group is greater than or equal to a preset number of hash values as the first hash value.
  • the hash computing unit is specifically configured to obtain the hash value storage table and the first The hash value calculation unit is specifically configured to acquire the first one, where the number of the repeated bits in the corresponding position of the hash value of the data packet is greater than or equal to the preset number of hash values.
  • a hash value is used as the first hash value.
  • the method and device for retrieving data firstly block and then group the received data, perform similarity hashing on the data blocks in the data packet, and obtain a data packet. a hash value, and then obtaining a hash value of the data packet and a first hash value similarity of each data packet stored in the data storage space stored in the hash value storage table greater than or equal to a preset first similarity threshold a hash value, determining whether the similarity between the hash value of the data packet and the first hash value is greater than or equal to a preset second similarity threshold, if greater than, indicating that the data partition in the data packet is greater than To a certain extent, it is a repeating block, and then it performs a repeated block retrieval.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space
  • the data packet is The number is relatively small, so the efficiency of querying the hash value storage table is high, and the repeated block retrieval based on the data packet reduces the number of repeated block retrievals, that is, reduces the number of interactions with the disk, which is beneficial to improve the efficiency of the repeated block query. This improves the overall performance of the deduplication technology.
  • FIG. 1 is a flowchart of a method for retrieving duplicate data according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a similarity hash operation process according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a repetitive data retrieval device according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a repetitive data retrieval device according to another embodiment of the present invention
  • FIG. FIG. 6 is a schematic structural diagram of a computer program product according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for retrieving duplicate data according to an embodiment of the present invention. As shown in Figure 1, The method of this embodiment includes:
  • Step 101 Perform block processing on the received data to obtain at least two data blocks.
  • the executor of the embodiment may be a repetitive data retrieval device, and the device may be a device with a computing capability in the implementation mode, for example, a server in a data backup environment, a computer, or the like, or a WAN data transmission scenario. Terminals, gateways, base stations, etc.
  • the data retrieval device After receiving the data to be stored, the data retrieval device first blocks the data to obtain at least two data blocks.
  • the data decryption device may perform a block processing on the data, and may be, for example, but not limited to, a Fixed-Sized Partition (FSP) algorithm, and a variable-blocking (Content-Defined Chunking). , referred to as CDC) algorithm, sliding block (sliding block in English) algorithm.
  • FSP Fixed-Sized Partition
  • CDC variable-blocking
  • sliding block sliding block in English
  • the size of the data block depends on the block algorithm used and the actual application requirements. The specific values of the embodiment of the present invention are not limited.
  • the process of performing block processing on data using various blocking algorithms is prior art and will not be described in detail herein. See the prior art.
  • Step 102 Group the at least two data blocks to obtain at least one data packet, where each data packet includes at least one data block.
  • the data retrieval device performs block processing on the data to obtain the data segmentation, and then performs packet processing on the obtained data block to obtain the data packet, and the number of the data packets may be smaller than the number of the data blocks.
  • the packet processing actually divides the acquired data into different data packets, and the specific grouping manner can be various.
  • the repeated data retrieval device may divide the plurality of data blocks in turn according to the principle that each data packet includes the same number of data partitions to form at least one data packet.
  • the repeated data retrieval device may further use the blocking algorithm to obtain at least one data packet for the divided data blocks.
  • the embodiment includes: forming, by the hash value of each of the at least two data partitions divided by the foregoing, the hash data to be blocked; and partitioning the data by any one of the at least two data partitions.
  • the length of the hash value (the length of the hash value of each data block is the same) is the sliding step size, and the block data is block-processed by the block algorithm to obtain at least one hash value.
  • the sliding step size refers to the minimum sliding distance when sliding on the block hash data, and the hash value block obtained by the blocking algorithm can be waited by sliding one or more times.
  • the hash block is composed of one or more complete hash values. If the sliding distance of a hash value block obtained by the block algorithm is a plurality of sliding step sizes (that is, after multiple sliding steps), the hash value block is composed of multiple hash values. If the sliding distance of a hash value block obtained by the block algorithm is a sliding step size (ie, after one sliding), the hash value block is composed of a hash value.
  • the data block corresponding to the hash value of the same hash value block is divided into one data packet, so that at least one data packet is obtained, and the grouping manner is adopted, so that The end position of each data packet is the end position of a block, and the division of the packets is more accurate.
  • the process of performing block processing on the block hash data by using the block algorithm is similar to the process of the existing block algorithm, and will not be described again.
  • the process of forming the hash data to be blocked by the hash value of each of the at least two data blocks includes: calculating a hash value of each of the at least two data blocks, These hash values are concatenated to form the hash data to be chunked.
  • each data packet is composed of consecutive data chunks.
  • the number of data blocks included in each data packet may be the same or different. Moreover, the number of data blocks included in the data packet may be determined according to the actual application, and the specific values of the embodiments of the present invention are not limited.
  • the repeated block retrieval based on the data packet is advantageous for reducing the number of repeated block retrievals, reducing the interaction with the disk, and improving the efficiency of the repeated block retrieval.
  • Step 103 Perform, for a first data packet in the at least one data packet, a similarity hash (or similar hash, or sim ash) on the data partition in the first data packet, and obtain a hash value of the first data packet.
  • a similarity hash or similar hash, or sim ash
  • Step 103 Perform, for a first data packet in the at least one data packet, a similarity hash (or similar hash, or sim ash) on the data partition in the first data packet, and obtain a hash value of the first data packet.
  • the similarity of the Greek values is greater than or equal to the preset second similarity threshold, and the data block in the first data packet is subjected to repeated block retrieval.
  • the embodiment is described by taking any one of the data packets as an example, and is referred to as the first data packet for convenience of distinction, that is, the first data packet. It may be any one of the at least one data packet obtained as described above.
  • the hash value storage table stores a correspondence between the hash value of the second data packet currently stored in the data storage space and the second data packet. For ease of differentiation and description, the data packets that have been currently stored in the data storage space are recorded as the second data packet.
  • the calculation method of the hash value of the second data packet stored in the hash value storage table is the same as the calculation method of the hash value of the first data packet in the embodiment, that is, the hash value of the second data packet is also Data points in the second data packet
  • the blocks obtained by the similarity hash operation, and the data blocks corresponding to the hash values do not overlap each other, that is, the data blocks in the second data packet are determined not to be duplicate blocks.
  • the data storage space refers to the storage space for storing data chunks, which may be a hard disk, a disk, or the like.
  • the hash value storage table in this embodiment is much smaller, so it can be stored in the memory, which is beneficial to improve the efficiency of querying the hash value storage table, and is beneficial to further Improve the efficiency of repeated block searches.
  • the hash value storage table is not limited to being stored in the memory, and may be stored on a disk or other storage device, but is preferably stored in the memory. After the data retrieval device obtains the data packet, the same processing is performed for each data packet. In this embodiment, the first data packet is taken as an example, and the repeated data retrieval device performs the following processing on the first data packet:
  • a similarity hash operation is performed on the data partition in the first data packet to obtain a hash value of the first data packet.
  • the principle of similarity hashing is that the higher the similarity between two data chunks, the greater the similarity of the calculated hash values, and vice versa.
  • the similarity hash operation is an arithmetic method capable of making the similarity of the hash values of the data blocks having higher similarity higher.
  • a method for performing similarity hashing on a first data packet includes: hashing each data chunk in the first data packet to obtain a hash of each data chunk in the first data packet a value; a hash value of each data block in the first data packet is represented in a binary manner, and each bit in the binary value is converted, and the value may be 0. The binary bit is replaced by -1, the binary bit with the value of 1 remains unchanged, and then the converted hash value is accumulated.
  • the corresponding bits of each converted hash value can be added, and the phase is added.
  • the first data packet includes n data blocks, which are respectively a first data block-nth data block, and each data block is hashed to obtain a binary form hash value. 2 shows that the hash values of the binary form of the first data block, the second data block, and the nth data block are 100110, 110000, and 001001, respectively, and the hash value of each data block is binary.
  • the hash values of the replaced binary forms of the first data block, the second data block, and the nth data block are 1-1-111-1, 11-1-1-1, respectively. -1 and -1-11-1-11, sequentially adding the corresponding bits in the hashed values of the n data blocks, and finally obtaining 13, 18, -22, -5, -2, 5 As a result, the value greater than 0 in the result is mapped to 1, and the value less than or equal to 0 is mapped to 0, resulting in a binary 110001, which is the hash value of the first data packet.
  • Another similarity hashing operation such as a perceptual hashing algorithm, may be employed ( Perceptual hash algorithm ) , to perform the similarity hashing operation on the data partitioning in the first data packet involved in the embodiment.
  • Perceptual hash algorithm a perceptual hashing algorithm
  • the principle of perceptual hash operation is to generate a "fingerprint" (English fingerprint) string for each picture, and then compare the fingerprints of different pictures. The higher the similarity of the comparison results, the higher the similarity of the pictures; Applying to the repeated data retrieval method provided in this embodiment, the principle is to calculate a hash value for each data packet, and then compare the hash values of different data packets. If the similarity between the two hash values is higher, Explain that the more data blocks that may be duplicated in the two data packets (ie, the greater the similarity between the two data packets).
  • the block search indicates that the data partitioning in the data packet is a repeated block in a large degree, and the repeated block retrieval improves the performance of the repeated block retrieval.
  • the method of the present embodiment will be described below in a comparative manner to improve the performance of repeated block retrieval.
  • the hash value similarity of the first data packet is greater than or equal to a hash value of the preset first similarity threshold, and is recorded as the first Hash value.
  • the multiple hash values may be obtained, where each hash belongs to the first hash. Value; if there is one hash value greater than or equal to the preset first similarity threshold, the hash value is taken as the first A hash value, that is, the first hash value obtained is one.
  • a hash value having the largest similarity to the hash value of the first data packet in the hash value storage table may be obtained as the first hash value, but is not limited thereto.
  • the implementation manner of obtaining the hash value that is similar to the hash value of the first data packet greater than or equal to the preset first similarity threshold may be: the duplicate data retrieval device acquires the hash data storage table and the first data packet The hash value corresponds to the number of repeated bits at the position greater than or equal to the preset number of hash values as the first hash value.
  • the number of repeated bits in the corresponding position of the two hash values represents the similarity of the two hash values; if the two hash values correspond to more repeating positions in the position, the two hash values are indicated. The higher the similarity; vice versa.
  • the preset number here is equivalent to the preset first similarity threshold.
  • the embodiment of the hash data storage device obtaining the hash value corresponding to the hash value of the first data packet in the hash value storage table is greater than or equal to the preset number of hash values as the first hash value, including:
  • the repeated data retrieval device acquires a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and stores the hash value whose Hamming distance is less than or equal to the preset Hamming distance threshold.
  • the hash value in the table is used as the first hash value.
  • the degree of repetition between the first data packet-data packet and the second data packet corresponding to each hash value in the hash value storage table is greater than or equal to the preset number of hash values as the first hash value, including:
  • the repeated data retrieval device acquires a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and stores the hash value whose Hamming distance is less than or equal to
  • other parameters capable of representing the similarity of the two hash values may be used.
  • the preset Hamming distance threshold here is equivalent to the above preset number.
  • the repeated data retrieval device compares the similarity between the hash value of the first data packet and the first hash value with a preset second similarity threshold, and is used to determine whether the first data packet needs to perform a repeated block search. If the similarity between the hash value of the first data packet and the first hash value is greater than or equal to the second similarity threshold, indicating the degree of repetition between the second data packet corresponding to the first hash value of the first data packet Very high, it can be determined that there are more duplicate blocks between the two, so a repeated block search of the first data packet is required.
  • the second similarity threshold may be a repetition number threshold.
  • comparing, by the repeated data retrieval device, the similarity between the hash value of the first data packet and the first hash value and the preset second similarity threshold may be: the repeated data retrieval device determines the first data packet. Whether the number of repeated digits in the position corresponding to the first hash value is greater than or equal to the preset repetition digit threshold.
  • the second similarity threshold is greater than or equal to the first similarity threshold.
  • the data storage space includes multiple storage areas, and each storage area has a number, and each storage area is used in order from the smallest to the largest.
  • the second data corresponding to the hash value of the second data packet can be known from the correspondence relationship.
  • the process of performing the repeated block retrieval on the first data packet may be: the duplicate data retrieval device acquires the number n of the storage area corresponding to the first hash value from the hash value storage table, and the number n corresponds to the storage area.
  • the data block and the hash value of the data block are loaded into the memory, where n is an integer greater than or equal to 0; then the data in the first data packet is the same as the hash value in the storage area corresponding to the number n. Comparing to complete a repeated block retrieval of data blocks within the first data packet.
  • the process of comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet may be: The data blocks in the first data packet that are the same as the hash values in the storage area corresponding to the number n and the number (n+1) are compared to complete the repeated block retrieval of the data blocks in the first data packet.
  • the data block in the first data packet that is the same as the hash value in the storage area corresponding to the number n and the number (n+1) is compared to complete the process of repetitive block retrieval of the data block in the first data packet.
  • the method may be: first comparing the hash value of each data block in the first data packet with the hash value in the storage area corresponding to the number n and the number (n+1), to obtain the number and the number in the first data packet.
  • the same hash value obtained here is the second hash value, and then the second hash value is in the data.
  • the corresponding data block and the second hash value in the packet are compared in the number n and the corresponding data block in the corresponding storage area of the number (n+1) to complete the repeated block of the data block in the first data packet. Search.
  • the number (n+1) corresponding storage area is the next storage area corresponding to the storage area of the number n, that is, when the number n is correspondingly stored After the area is filled, continue to write data to the corresponding area (n+1) corresponding storage area. Because the data received next may be under the storage area corresponding to the first hash value.
  • a storage area (that is, a storage area numbered (n+1)) has duplicate data, so the storage area corresponding to the first hash value (that is, the storage area numbered n) corresponds to the first hash value at a time.
  • the content of the next storage area of the storage area is added to the memory, which is beneficial to improve the efficiency of the subsequent repeated block retrieval process, thereby facilitating the overall efficiency of the repeated block retrieval.
  • the preferred partition storage mode is: centralized storage in a storage area according to the order of receiving data blocks, and when the storage area is full, the received data blocks are stored in the next storage area.
  • Each storage area is a storage space, and each storage area has a certain size, for example, but not limited to 64 MB.
  • the hash values of the data block and the data block are simultaneously stored in each storage area, and the specific storage manner is not limited.
  • a preferred storage mode of the storage area is as follows: The storage area is divided into two parts, one part is a data segment area, the data segment area stores data partitioning; the other part is a metadata area, and the metadata area stores Metadata corresponding to the data block in the data segment area, where the metadata includes a hash value of the data block, a length of the data block, a length of the data segment, and some check code, etc., in the present invention
  • the hash value of the data block in the metadata is mainly used.
  • the first data packet is illustrated.
  • the degree of repetition between the second data packets corresponding to the first hash value is not high, and it can be determined that there is no duplicate block between the two or the number of duplicate blocks is very small, for example, there may be only one or two in the first data packet.
  • the data block is duplicated in the data block in the second data packet corresponding to the first hash value. To improve overall performance, the data block in the first data packet may be processed as new data, that is, not repeated.
  • Block retrieval is stored directly into the data storage space. Further, if the data storage space includes a plurality of storage areas, the duplicate data retrieval device can directly store the data partitions in the first data packet and the hash values of the data chunks into the currently used storage area.
  • the repeated data retrieval method first blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then obtains a hash value of the data packet, and then Obtaining a first hash value of a hash value of the data packet and a hash value of each data packet stored in the data storage space stored in the hash value storage table that is greater than or equal to a preset first similarity threshold , determining whether the similarity between the hash value of the data packet and the first hash value is greater than or equal to The preset second similarity threshold, if greater than, indicates that the data partitioning in the data packet is largely a duplicate block, and then the block retrieval is performed, since the query hash value storage table is stored in the The correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the efficiency of querying the has
  • FIG. 3 is a schematic structural diagram of a duplicate data retrieval device according to an embodiment of the present invention.
  • the data retrieving device in this embodiment may be a device having a computing capability and a storage capability in a specific implementation manner, for example, a server in a data backup environment, a computer, or the like, or a terminal in a WAN data transmission scenario. Gateways, base stations, and the like, the specific embodiments of the present invention do not limit the specific implementation of the repeated data retrieval device.
  • the device in this embodiment includes: a block obtaining module 31, a group obtaining module 32, a hash calculating module 33, and a repeating search module 34.
  • the block obtaining module 31 is configured to perform block processing on the received data to obtain at least two data blocks.
  • the packet obtaining module 32 is connected to the block obtaining module 31 and configured to group at least two data blocks obtained by the block obtaining module 31 to obtain at least one data packet, and each data packet includes at least one data block.
  • the hash calculation module 33 is connected to the packet obtaining module 32, and is configured to perform similarity hashing on the data partition in the first data packet for the first data packet in the at least one data packet acquired by the packet obtaining module 32. Obtaining a hash value of the first data packet, and acquiring a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a preset first similarity threshold.
  • the hash value storage table stores a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, and the hash value of the second data packet is according to the second data packet.
  • the data block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet.
  • the repeated search module 34 is connected to the hash calculation module 33, and the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation module 33 is greater than or equal to a preset second similarity threshold. At the same time, a repeated block retrieval is performed on the data partitioning within the first data packet.
  • the repeated data retrieval device of the embodiment further includes Includes: storage module 35.
  • the storage module 35 is connected to the hash calculation module 33, and is configured to: when the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation module 33 is less than the second similarity threshold, The data block within the data packet and the hash value of the data block within the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored to the hash. The value is stored in the table.
  • the hash calculation module 33 performs the same actions for each data packet.
  • the packet obtaining module 32 is specifically configured to use the hash value of each data block in the at least two data blocks obtained by the block obtaining module 31 to form the block hash data to be at least
  • the length of the hash value of each data block in the two data blocks is a sliding step size, and the block data is subjected to block processing by using the block algorithm to obtain at least one hash value block, The data blocks corresponding to the hash values of the same hash value block are used as one data packet, thereby obtaining at least one data packet.
  • the hash calculation module 33 is configured to perform a similarity hash operation on the data partition in the first data packet, and obtain the hash value of the first data packet, where the hash value includes: the hash calculation module 33 Specifically, the hash operation is performed on each data block in the first data packet, and the hash value of each data block in the first data packet is obtained, and the hash of each data block in the first data packet is hashed. The 0 in the value is replaced by -1, the corresponding bits of the hash value of all data blocks in the first data packet are added, the bits added by greater than 0 are mapped to 1, and the bit maps less than or equal to 0 are added. Is 0, the obtained binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas.
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of the storage area where the second data packet is located.
  • the repeated retrieval module 34 is specifically configured to obtain the number n of the storage area corresponding to the first hash value from the hash value storage table, and load the hash value of the data partition and the data partition corresponding to the number n corresponding to the storage area.
  • n is an integer greater than or equal to 0; comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the data partitioning in the first data packet Repeated block retrieval.
  • the repeated retrieval module 34 is further configured to: when the hash value of the data partition and the data partition in the storage area corresponding to the number n is loaded into the memory, the number (n+1) is correspondingly The data chunks in the storage area and the hash values of the data chunks are loaded into the memory.
  • the cable module 34 is specifically configured to compare the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet, including: The module 34 is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the repetition of the data block in the first data packet. Block retrieval.
  • the hash calculation module 33 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a first similarity threshold that is greater than or equal to a preset first similarity threshold.
  • the hash calculation module 33 is specifically configured to obtain, as the first hash, the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to a preset number of hash values. value.
  • the hash calculation module 33 is specifically configured to obtain, in the hash value storage table, the number of the repeated bits at the position corresponding to the hash value of the first data packet is greater than or equal to the preset number of hash values as the first hash value.
  • the hash calculation module 33 is specifically configured to obtain a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the preset Hamming distance threshold.
  • the hash value stores the hash value in the table as the first hash value.
  • the functional modules of the repeated data retrieval device provided by the embodiments of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1.
  • the specific working principle is not described here. For details, refer to the description of the method embodiments.
  • the repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet.
  • the hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
  • FIG. 5 is a schematic structural diagram of a repeated data retrieval device according to another embodiment of the present invention.
  • the repetitive data retrieval device of the embodiment may be a device with computing power and storage capability in a specific implementation manner, for example, a server, a computer, or the like in a data backup environment, or a terminal and a gateway in a WAN data transmission scenario.
  • the base station and the like, the specific embodiment of the present invention does not limit the specific implementation of the repeated data retrieval device.
  • the repeated data retrieval device of this embodiment includes:
  • the bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus. Wait.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus.
  • a communication interface 53 for receiving data is only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus.
  • the processor 51 is configured to execute a program.
  • the program can include program code, the program code including computer operating instructions.
  • the processor 51 may be a central processing unit (CPU), an application specific integrated circuit (hereinafter referred to as an ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the memory 52 is used to store a program.
  • the memory 52 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory.
  • the foregoing program may be specifically configured to: perform block processing on the data received by the communication interface 53 to obtain at least two data blocks; group the at least two data blocks to obtain at least one data packet, and each data packet Include at least one data block; perform a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, obtain a hash value of the first data packet, and obtain a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table stores the data that has been stored in the data storage space.
  • Corresponding relationship between the hash value of the second data packet and the second data packet wherein the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet; a data packet is any one of the at least one data packet; if the hash value of the first data packet is greater than or similar to the first hash value Equal to the preset second similarity threshold, performing repeated block retrieval on the data partitioning in the first data packet.
  • the program stored in the memory 52 is further configured to: when the similarity between the hash value of the first data packet and the first hash value is less than the second similarity threshold, The data block and the hash value of the data block in the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table. .
  • the program stored in the memory 52 is configured to group the at least two data blocks to obtain at least one data packet.
  • the program is specifically configured to be used by each of the at least two data blocks.
  • the hash value of the data block constitutes the hash data to be blocked, and the length of the hash value of any data block is the sliding step size, and the block data is used to block the block data to be blocked. Processing, obtaining at least one hash value block, and dividing the data block corresponding to the hash value belonging to the same hash value block as one of the data packets.
  • the program stored in the memory 52 is configured to perform a similarity hash operation on the data partitioning in the first data packet, and obtaining the hash value of the first data packet includes: the program is specifically used for Each data block in a data packet is hashed, and a hash value of each data block in the first data packet is obtained, and 0 of the hash value of each data block in the first data packet is replaced with -1, adding the corresponding bits of the hash value of all data blocks in the first data packet, mapping the bits added by greater than 0 to 1, and mapping the bits added less than or equal to 0 to 0, the obtained binary The value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas; and the hash value storage table further stores a correspondence between the hash value of the second data packet and the number of the storage area where the second data packet is located.
  • the program stored in the memory 52 is used to perform the repeated block retrieval on the data partitioning in the first data packet.
  • the program is specifically configured to obtain the number of the storage area corresponding to the first hash value from the hash value storage table.
  • n the hash value corresponding to the data partition and the data chunk in the storage area corresponding to the number n is loaded into the memory; wherein n is an integer greater than or equal to 0; the first data packet corresponds to the number n corresponding to the storage area The data blocks having the same hash value are compared to complete a repeated block search for the data block within the first data packet.
  • the program stored in the memory 52 is further configured to: when the data block and the hash value of the data block in the storage area corresponding to the number n are loaded into the memory, the number (n+1) is corresponding to the storage area. The hash of the data chunks and data chunks is loaded into memory.
  • the program is specifically used for Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes: the program is specifically used to be the first The data blocks in the data packet are compared with the data blocks having the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the repeated block retrieval of the data block in the first data packet.
  • the program stored in the memory 52 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by greater than or equal to a preset first similarity threshold.
  • the program is specifically configured to obtain, as the first hash value, a hash value in a position corresponding to a hash value of the first data packet in the hash value storage table that is greater than or equal to a preset number of hash values. .
  • the program stored in the memory 52 is specifically configured to obtain the hash number of the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to the preset number of hashes.
  • the value as the first hash value includes: the program is specifically configured to obtain a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and the Hamming distance is less than or equal to the pre- The hash value in the hash value storage table of the Hamming distance threshold is set as the first hash value.
  • the repeated data retrieval device provided by the embodiment of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1.
  • the specific working principle is not described here. For details, refer to the description of the method embodiment.
  • the repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet.
  • the hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
  • An embodiment of the invention provides a computer program product comprising a computer readable storage medium for storing a program.
  • the program includes:
  • the block obtaining unit 81 is configured to perform block processing on the received data to obtain at least two data blocks.
  • the packet obtaining unit 82 is connected to the block obtaining unit 81 and configured to group at least two data blocks acquired by the block obtaining unit 81 to acquire at least one data packet, and each data packet includes at least one data block.
  • the hash calculation unit 83 is connected to the packet acquisition unit 82, and is configured to perform a similarity hash operation on the data partition in the first data packet for the first data packet in the at least one data packet acquired by the packet acquisition unit 82, Obtaining a hash value of the first data packet, and acquiring a first hash value in the hash value storage table that is similar to a hash value of the first data packet, greater than or equal to a preset first similarity threshold, the hash
  • the value storage table stores a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, and the hash value of the second data packet is based on the data in the second data packet.
  • the block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet.
  • the repeating retrieval unit 84 is connected to the hash computing unit 83, and configured to use the first data when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold.
  • the data blocks within the packet are subjected to repeated block retrieval.
  • the repeated data retrieval device of this embodiment further includes: a storage unit 85.
  • the storage unit 85 is connected to the hash calculation unit 83, and is configured to: when the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation unit 83 is less than the second similarity threshold, The data block within the data packet and the hash value of the data block within the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored to the hash. The value is stored in the table.
  • the above-described hash calculation unit 83, the repetition retrieval unit 84, and the storage unit 85 perform the same operation for each data packet.
  • the packet obtaining unit 82 is specifically configured to use the hash value of each data block in the at least two data blocks obtained by the block obtaining unit 81 to form the to-be-blocked hash data, to at least The length of the hash value of each data block in the two data blocks is a sliding step size, and the block data is subjected to block processing by using the block algorithm to obtain at least one hash value block, The data blocks corresponding to the hash values of the same hash value block are used as one data packet, thereby obtaining at least one data packet.
  • the hash calculation unit 83 is configured to perform a similarity hash operation on the data partition in the first data packet, and obtain a hash value of the first data packet, including: a hash meter.
  • the calculating unit 83 is specifically configured to perform a hash operation on each data block in the first data packet, obtain a hash value of each data block in the first data packet, and block each data in the first data packet.
  • the 0 of the hash value is replaced by -1, the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bits added by greater than 0 are mapped to 1, and the addition is less than or equal to 0.
  • the bit map is 0, and the obtained binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas.
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of the storage area where the second data packet is located.
  • the repeated retrieval unit 84 is specifically configured to obtain the number n of the storage area corresponding to the first hash value from the hash value storage table, and load the hash value of the data partition and the data partition corresponding to the number n corresponding to the storage area.
  • n is an integer greater than or equal to 0; comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the data partitioning in the first data packet Repeated block retrieval.
  • the retrieving unit 84 is further configured to: when the data block and the hash value of the data block in the storage area corresponding to the number n are loaded into the memory, the number (n+1) is corresponding to The data chunks in the storage area and the hash values of the data chunks are loaded into the memory. Based on this, the repeated retrieval unit 84 is specifically configured to compare the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data partition in the first data packet.
  • the method includes: the repeated retrieval unit 84 is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the data in the first data packet. Repeated block retrieval of chunks.
  • the hash calculation unit 83 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a preset first similarity threshold.
  • the hash calculation unit 83 is specifically configured to obtain, as the first hash, the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to a preset number of hash values. value.
  • the hash calculation unit 83 is specifically configured to obtain, in the hash value storage table, the number of the repeated bits on the position corresponding to the hash value of the first data packet is greater than or equal to the preset number of hash values as the first hash value.
  • the hash calculation unit 83 is specifically configured to obtain a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the preset Hamming distance threshold.
  • the hash value stores the hash value in the table as the first hash value.
  • the repetitive data retrieval device provided by the embodiment of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1. The specific working principle is not described here. For details, refer to the description of the method embodiment.
  • the repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet.
  • the hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et un dispositif de récupération de données en double. Le procédé comprend les étapes consistant : à segmenter les données reçues pour acquérir au moins deux segments de données ; à grouper les au moins deux segments de données pour obtenir au moins un groupement de données ; et en ce qui concerne chaque groupement de données, à effectuer un algorithme de hachage de similitude sur les segments de données dans le groupement de données pour acquérir une valeur de hachage du groupement de données, et à acquérir une première valeur de hachage d'une première valeur de seuil de similitude qui est supérieure ou égale à une similitude de valeur de hachage du groupement de données dans une table de stockage de valeurs de hachage, et si la valeur de hachage du groupement de données et la similitude de la première valeur de hachage sont supérieures ou égales à une seconde valeur de seuil de similitude prédéfinie, la récupération du segment en double est effectuée sur les segments de données dans le groupement de données. La solution technique de la présente invention augmente l'efficacité de recherche d'un segment en double, ce qui améliore la performance générale de la technique de suppression de données en double.
PCT/CN2012/083740 2012-10-30 2012-10-30 Procédé et dispositif de récupération de données en double WO2014067063A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2012/083740 WO2014067063A1 (fr) 2012-10-30 2012-10-30 Procédé et dispositif de récupération de données en double
CN201280001989.7A CN103189867B (zh) 2012-10-30 2012-10-30 重复数据检索方法及设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/083740 WO2014067063A1 (fr) 2012-10-30 2012-10-30 Procédé et dispositif de récupération de données en double

Publications (1)

Publication Number Publication Date
WO2014067063A1 true WO2014067063A1 (fr) 2014-05-08

Family

ID=48679810

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/083740 WO2014067063A1 (fr) 2012-10-30 2012-10-30 Procédé et dispositif de récupération de données en double

Country Status (2)

Country Link
CN (1) CN103189867B (fr)
WO (1) WO2014067063A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202212A (zh) * 2016-06-28 2016-12-07 微梦创科网络科技(中国)有限公司 一种基于数据服务器集群实现数据拆分的方法及系统

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014067063A1 (fr) * 2012-10-30 2014-05-08 华为技术有限公司 Procédé et dispositif de récupération de données en double
CN104823184B (zh) * 2013-09-29 2016-11-09 华为技术有限公司 一种数据处理方法、系统及客户端
CN103858125B (zh) * 2013-12-17 2015-12-30 华为技术有限公司 重复数据处理方法、装置及存储控制器和存储节点
CN105843859B (zh) * 2016-03-17 2019-05-24 华为技术有限公司 数据处理的方法、装置和设备
CN106873964A (zh) * 2016-12-23 2017-06-20 浙江工业大学 一种改进的SimHash代码相似度检测方法
CN107644081A (zh) * 2017-09-21 2018-01-30 锐捷网络股份有限公司 数据去重方法及装置
CN110134544A (zh) * 2018-02-08 2019-08-16 广东亿迅科技有限公司 数据自动化备份的方法及其系统
CN108763270A (zh) * 2018-04-07 2018-11-06 长沙开雅电子科技有限公司 一种重复数据删除哈希表存储实现方法
CN108875062B (zh) * 2018-06-26 2021-07-23 北京奇艺世纪科技有限公司 一种重复视频的确定方法及装置
CN109670153B (zh) * 2018-12-21 2023-11-17 北京城市网邻信息技术有限公司 一种相似帖子的确定方法、装置、存储介质及终端
CN110909019B (zh) * 2019-11-14 2022-04-08 湖南赛吉智慧城市建设管理有限公司 大数据查重方法、装置、计算机设备及存储介质
CN111628909B (zh) * 2020-05-25 2021-08-20 上海德吾信息科技有限公司 一种用于无线通信的数据重复发送标记系统及方法
CN114064621B (zh) * 2021-10-28 2022-07-15 江苏未至科技股份有限公司 一种重复数据判断方法
CN114817230A (zh) * 2022-06-29 2022-07-29 深圳市乐易网络股份有限公司 一种数据流过滤方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (zh) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 一种实现重复数据数据删除的方法和系统
CN101887457A (zh) * 2010-07-02 2010-11-17 杭州电子科技大学 基于内容的复制图像检测方法
CN102467572A (zh) * 2010-11-17 2012-05-23 英业达股份有限公司 支持重复数据删除程序的数据区块查询方法
WO2012092212A2 (fr) * 2010-12-28 2012-07-05 Microsoft Corporation Utilisation du partitionnement d'index et de la réconciliation pour la déduplication de données
US20120233135A1 (en) * 2011-01-17 2012-09-13 Quantum Corporation Sampling based data de-duplication
CN103189867A (zh) * 2012-10-30 2013-07-03 华为技术有限公司 重复数据检索方法及设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7469241B2 (en) * 2004-11-30 2008-12-23 Oracle International Corporation Efficient data aggregation operations using hash tables
US9245007B2 (en) * 2009-07-29 2016-01-26 International Business Machines Corporation Dynamically detecting near-duplicate documents
CN102622365B (zh) * 2011-01-28 2015-04-29 北京百度网讯科技有限公司 一种网页重复的判断系统及其判断方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (zh) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 一种实现重复数据数据删除的方法和系统
CN101887457A (zh) * 2010-07-02 2010-11-17 杭州电子科技大学 基于内容的复制图像检测方法
CN102467572A (zh) * 2010-11-17 2012-05-23 英业达股份有限公司 支持重复数据删除程序的数据区块查询方法
WO2012092212A2 (fr) * 2010-12-28 2012-07-05 Microsoft Corporation Utilisation du partitionnement d'index et de la réconciliation pour la déduplication de données
US20120233135A1 (en) * 2011-01-17 2012-09-13 Quantum Corporation Sampling based data de-duplication
CN103189867A (zh) * 2012-10-30 2013-07-03 华为技术有限公司 重复数据检索方法及设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202212A (zh) * 2016-06-28 2016-12-07 微梦创科网络科技(中国)有限公司 一种基于数据服务器集群实现数据拆分的方法及系统

Also Published As

Publication number Publication date
CN103189867B (zh) 2016-05-25
CN103189867A (zh) 2013-07-03

Similar Documents

Publication Publication Date Title
WO2014067063A1 (fr) Procédé et dispositif de récupération de données en double
US11627207B2 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US10592348B2 (en) System and method for data deduplication using log-structured merge trees
US9569357B1 (en) Managing compressed data in a storage system
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
JP6110517B2 (ja) データオブジェクト処理方法及び装置
US9298726B1 (en) Techniques for using a bloom filter in a duplication operation
US10152389B2 (en) Apparatus and method for inline compression and deduplication
US20180113767A1 (en) Systems and methods for data backup using data binning and deduplication
WO2013086969A1 (fr) Procédé, dispositif et système permettant de trouver des données en double
JP2012525633A5 (fr)
WO2017020576A1 (fr) Procédé et appareil de compactage de fichiers dans un système de stockage clé/valeur
WO2014094479A1 (fr) Procédé et dispositif permettant de supprimer des données dupliquées
CN108415671B (zh) 一种面向绿色云计算的重复数据删除方法及系统
US11995050B2 (en) Systems and methods for sketch computation
CN103152430B (zh) 一种缩减数据占用空间的云存储方法
CN108027713A (zh) 用于固态驱动器控制器的重复数据删除
US10339124B2 (en) Data fingerprint strengthening
CN106980680B (zh) 数据存储方法及存储设备
US11366790B2 (en) System and method for random-access manipulation of compacted data files
WO2021127245A1 (fr) Systèmes et procédés de calcul d'esquisse
WO2015061995A1 (fr) Procédé de traitement de données, dispositif et processeur de reproduction
US20210191640A1 (en) Systems and methods for data segment processing
Zhou et al. Hysteresis re-chunking based metadata harnessing deduplication of disk images
JP2017097437A (ja) 情報処理システム、情報処理装置、及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12887672

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12887672

Country of ref document: EP

Kind code of ref document: A1