WO2016041384A1 - 重复数据删除方法和装置 - Google Patents

重复数据删除方法和装置 Download PDF

Info

Publication number
WO2016041384A1
WO2016041384A1 PCT/CN2015/080906 CN2015080906W WO2016041384A1 WO 2016041384 A1 WO2016041384 A1 WO 2016041384A1 CN 2015080906 W CN2015080906 W CN 2015080906W WO 2016041384 A1 WO2016041384 A1 WO 2016041384A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
data block
time period
threshold
deduplication
Prior art date
Application number
PCT/CN2015/080906
Other languages
English (en)
French (fr)
Inventor
李育国
游俊
张宗全
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19162092.1A priority Critical patent/EP3564844B1/en
Priority to EP17208407.1A priority patent/EP3361409B1/en
Priority to EP15841499.5A priority patent/EP3153987B1/en
Publication of WO2016041384A1 publication Critical patent/WO2016041384A1/zh
Priority to US15/403,318 priority patent/US10564880B2/en
Priority to US16/738,401 priority patent/US11531482B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold

Definitions

  • the embodiments of the present invention relate to computer technologies, and in particular, to a method and an apparatus for deleting data.
  • the storage server performs all the deduplication and compression processing on the received data. Specifically: for a data block that has been stored in the primary storage domain, when the external device sends data to a logical address of the storage server (assumed to be address 0), the storage server first divides the data into blocks.
  • the fingerprint of the data block is calculated by using a corresponding hash algorithm; and the fingerprint of the data block is sent to the fingerprint database for querying (the fingerprint library stores the fingerprint corresponding to the stored data block in the storage space), Determining whether the data block is a duplicate block; if so, performing a deduplication operation, that is, deleting the duplicate block, and storing the same data block in the storage space as the duplicate block (the same data block as the duplicate block is in the storage space)
  • the address in the address is address 1, the address 1 is the physical address), the number of references is increased by 1, and the address 0 is pointed to the address 1; if it is a unique block, the data block is saved to the storage space, of course, it can be selected before saving
  • the data block performs a compression operation and allocates the physical address to store the data block accordingly.
  • the external device may continue to send a write request to the address to overwrite the data on address 0, so the above-mentioned execution on the data at address 0 is performed.
  • the deduplication operation will be meaningless, wasting the computing resources of the storage server.
  • the embodiments of the present invention provide a data deletion method and device, which are used to solve the technical problem that the storage server resources are wasted when the data block performs the deduplication operation.
  • an embodiment of the present invention provides a data deletion method, including:
  • the determining, by the first address, whether the number of times of the writes exceeds the first threshold in the [t1, t2] time period specifically includes:
  • the first record table is configured to record an address that covers the number of writes exceeding the first threshold in the [t1, t2] time period.
  • the method when determining that the first address is covered in the [t1, t2] time period The number of writes does not exceed the first threshold, and the method further includes:
  • the first address is incremented by one in the [t1, t2] time period.
  • the method further includes:
  • a fourth possible implementation manner of the first aspect if the first address does not exceed the second number of times of writing in the (t2, t3) time period And threshold value, wherein the first address is deleted from the first record table; wherein the t3 is a time point greater than t2.
  • the method further includes:
  • the recovering, at the first address, the data block when the deduplication operation is performed last time specifically Includes:
  • an embodiment of the present invention provides a data deduplication apparatus, including:
  • a receiving module configured to receive an overlay write request sent by an external device, where the overlay write request carries a data block and a first address of the data block to be stored;
  • a judging module configured to determine whether the first address overwrites the number of writes in the [t1, t2] time period exceeds a first threshold; wherein the t1 and t2 are time points, and the t2 is greater than t1;
  • a de-duplication module configured to: when the determining module determines that the first address exceeds the first threshold in the [t1, t2] time period, does not perform a deduplication operation on the data block; Determining that the first address does not exceed the first threshold in the [t1, t2] time period, performing a deduplication operation on the data block.
  • the determining module is specifically configured to query whether the first address exists in the first record table; the first record table is used for recording An address whose write count exceeds the first threshold is overwritten during the [t1, t2] period.
  • the device further includes:
  • a counting module configured to: when the determining module determines that the first address exceeds a first threshold in the [t1, t2] time period, the first address is in the [t1, t2 In the time period, the number of overwrites is increased by one.
  • the device further includes:
  • a recording module configured to: when the determining module determines that the number of overwrite writes on the first address exceeds the first threshold in the [t1, t2] time period, record the first address in the In the first record table, the first address is directed to a second address in the lookup table; wherein the lookup table includes a mapping relationship between the second address and a fingerprint of the data block.
  • the recording module is further configured to: when the number of overwrites of the first address in the (t2, t3) time period does not exceed a second threshold, the first address is from the first record table. Deleted; wherein the t3 is a time point greater than t2.
  • the receiving module is further configured to receive a read request sent by the external device, where the read request is Carrying the first address;
  • the device further comprises:
  • the data recovery module is further configured to: when the determining module determines that the number of times the first address is read in the (t2, t4) time period exceeds a third threshold, restoring the last execution of the first address The data block at the time of the de-duplication operation; wherein the t4 is a time point greater than t2.
  • the data recovery module includes:
  • a data reading unit configured to read a data block on the second address
  • a data recovery unit configured to recover a data block on the second address, to obtain a data block when the deduplication operation is performed last time on the first address
  • a storage marking unit configured to store the data block when the deduplication operation is performed last time on the first address to a third address, and mark the first address as a non-deduplication operation.
  • an embodiment of the present invention provides a data deduplication apparatus, including a central processing unit and a memory, wherein the central processing unit and the memory communicate with each other, and the memory storage computer executes instructions, and the central processing unit executes the The computer executes instructions for performing any of the first to sixth possible embodiments of the first aspect or the first aspect of the embodiments of the present invention.
  • the method and device for deleting data receive, by the storage server, an overlay write request of the carried data block and the first address sent by the external device, and determine that the first address is overwritten during the [t1, t2] time period. Whether the number of times exceeds the first threshold, and when it is determined that the number of times the first address is overwritten in the [t1, t2] time period exceeds the first threshold, the data block of the first address is not subjected to the deduplication operation, thereby saving a large number of storage servers. Computational resources also reduce the impact of deduplication operations on data storage network performance.
  • Embodiment 1 is a schematic flowchart of Embodiment 1 of a method for deleting data according to the present invention
  • FIG. 2 is a network topology diagram of a storage system provided by the present invention.
  • Embodiment 3 is a schematic flowchart of Embodiment 2 of a method for deleting data according to the present invention
  • FIG. 4 is a schematic structural diagram of Embodiment 1 of a data deduplication apparatus according to an embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of Embodiment 2 of a data deduplication apparatus according to an embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of Embodiment 3 of a data deduplication apparatus according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of Embodiment 4 of a data deduplication apparatus according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart diagram of Embodiment 1 of a method for deleting data according to the present invention.
  • Deduplication hereinafter referred to as deduplication.
  • the execution entity of the method may be a storage server, and may specifically be a deduplication module in the storage server. As shown in Figure 1, the method includes:
  • S101 Receive an overlay write request sent by an external device, where the overlay write request carries a data block and a first address of the data block to be stored.
  • the embodiment of the present invention can be applied to the network topology diagram of the storage system as shown in FIG. 2.
  • the external device sends an overlay write request to the storage server through a data storage network, for example, a storage area network (SAN).
  • the overlay write request carries the data block and the first address of the data block to be stored.
  • the first address may be a logical address.
  • S102 Determine whether the first address overwrites the number of writes in the [t1, t2] time period exceeds a first threshold; wherein, t1 and t2 are time points, and the t2 is greater than t1; if yes, the data block Performing a deduplication operation; if not, performing a deduplication operation on the data block.
  • the storage server determines whether the number of overwrite writes of the first address carried in the overwrite write request in the [t1, t2] time period exceeds a first threshold.
  • the [t1, t2] time period can be set by a corresponding software, for example, by a timer software setting of a timer.
  • the first threshold may be set by the user, or may be a threshold set by the storage server according to actual needs.
  • the storage server determines that the number of overwrite writes of the first address in the [t1, t2] time period exceeds the first threshold, the storage server does not perform the deduplication operation on the data block, but performs the processing flow of the data block according to the prior art. Processing, for example, writing the data block to a corresponding physical address or logical address (because the storage server knows that a data block will continue to arrive at the first address in a short time, so the data block is no longer deleted. )).
  • the storage server determines that the number of overwrite writes in the [t1, t2] time period does not exceed the first threshold, the storage server performs a deduplication operation on the data block, that is, if the data block is a duplicate data block, the repetition is performed.
  • the data block is deleted; if the data block is a unique data block, the data block is secured, and the fingerprint and reference count of the data block are recorded.
  • the data block may be compressed, and the compressed data block is stored in a physical address or a logical address, and the data block may be directly stored in a physical address or a logical address.
  • the storage server determines whether the number of times of overwrite of the first address in the [t1, t2] time period exceeds the first threshold, so that the storage server does not perform the double deletion operation on any address, but performs overwrite writing.
  • the data block at the address whose number does not exceed the first threshold is subjected to the deduplication operation, and the data block at the address where the number of times of overwriting exceeds the first threshold is not subjected to the deduplication operation, thereby saving a large amount of computing resources of the storage server. It also reduces the impact of deduplication operations on data storage network performance.
  • the storage server receives the overlay data write request sent by the external device and the first address, and determines whether the first address overwrites the write time in the [t1, t2] time period. If the first threshold is exceeded, and when the number of times the first write is overwritten in the [t1, t2] time period, the data block of the first address is not subjected to the deduplication operation, thereby saving a large amount of computing resources of the storage server. It also reduces the impact of deduplication operations on data storage network performance.
  • the method in this embodiment is a process in which the storage server determines whether the first address exists in the first record table, thereby determining whether to perform a deduplication operation on the data block of the first address.
  • the foregoing S102 specifically includes: querying whether the first address exists in the first record table; and the first record table is used to record in the [t1, t2 The address in the time period that the number of writes exceeds the first threshold is overwritten.
  • the storage server queries whether there is a first address carried in the overwrite write request in the first record table.
  • the first record table may include one or more addresses, which are addresses that exceed the first threshold in the [t1, t2] time period, that is, addresses with a higher probability of overwriting writes, these addresses Both can be logical addresses.
  • the address may be in the form of a set of addresses in the first record table, or may be in the form of a mapping relationship between the address and the number of times of overwriting on the address, and the first record table in the embodiment of the present invention
  • the storage form of the address in the address is not limited.
  • the storage server determines that the first address exists in the first record table, the storage server does not perform the deduplication operation on the data block, but processes the processing flow of the data block according to the prior art, for example, writing the data block to the corresponding address. In the physical address or logical address (because the storage server knows that there will be a continual data block reaching the first address in a short time, so the data block is no longer deduplicated).
  • the storage server determines that the first address does not exist in the first record table (that is, the number of overwrite writes of the first address in the [t1, t2] time period does not exceed the first threshold)
  • the storage server performs deduplication on the data block. Operation, that is, if the data block is a duplicate data block, the duplicate data block is deleted; if the data block is a unique data block, the data block is protected, and the fingerprint and reference count of the data block are recorded.
  • the data block may be compressed, and the compressed data block is stored in a physical address or a logical address, and the data block may be directly stored in a physical address or a logical address.
  • the data deduplication method receives, by the storage server, the overlay data write request sent by the external device and the first address, and queries whether the first address exists in the first record table, and is in the first record table.
  • the data block of the first address is not subjected to the deduplication operation, thereby saving a large amount of computing resources of the storage server, and also reducing the impact of the deduplication operation on the performance of the data storage network.
  • the storage server when the storage server determines that the first record table does not exist in the first place Address, that is, when the storage server determines that the first address does not exceed the first threshold in the [t1, t2] time period, the storage server overwrites the first address in the [t1, t2] time period. The number of writes is increased by 1. When the overwrite occurs again on the first address, the storage server still determines whether the first address exists in the first record table. If not, the storage server not only performs the deduplication operation on the data block of the first address, but also The first address is incremented by 1 in the [t1, t2] time period, and so on.
  • the storage server When the first address overlaps the number of writes in the [t1, t2] time period and exceeds the first threshold, the storage server records the first address in the first record table, and points the first address to the lookup table. a second address; wherein the lookup table includes a mapping relationship between the second address and a fingerprint of the data block.
  • the storage server is to be stored in the foregoing
  • the data block at an address is subjected to a deduplication operation; at the same time, the number of overwrite writes of the first address in the [t1, t2] time period is incremented by one.
  • the storage server still receives some The data block is written to the first address of the overlay write request, and the storage server still performs the deduplication operation on the data block to be stored to the first address, but the storage server stores the first address to the first record at this time.
  • the storage server stores the first address to the first record at this time.
  • the storage server performs the deduplication operation on the data block, and stores the compressed data block to the second address in the lookup table (the The second address is a new address assigned by the storage server to the unique block in the lookup table, and the storage server establishes a mapping relationship between the unique block and the second address, and points the first address to the second address, In this way, when the external device accesses the first address, the data block on the second address can be accessed indirectly.
  • the storage server performs the deduplication operation on the data block, searching and storing the duplicate in the lookup table according to the fingerprint of the duplicate block on the first address.
  • the address of the block Since the fingerprint of the duplicate block corresponds to the second address (ie, the data block stored on the second address is the same as the duplicate block), the storage server points the first address to the second address. In this way, when the external device accesses the first address, the second address can also be accessed indirectly.
  • the data block on it is a duplicate block
  • the storage server still monitors whether the number of overwrite writes that occurred during the (t2, t3) time period of the first address exceeds a second threshold, and if not, It is indicated that the first address has a low probability of overwriting writes during the (t2, t3) time period or no overwrite occurs at all, and the storage server deletes the first address from the first record table.
  • the second threshold may be 0 or an integer greater than 0.
  • the above t3 is a time point greater than t2.
  • FIG. 3 is a schematic flowchart diagram of Embodiment 2 of a method for deleting data according to an embodiment of the present invention.
  • the method of this embodiment is that after the first address is recorded in the first record table, the storage server determines that the number of times the external device reads the first address exceeds a certain threshold, and when the first address is performed on the first address.
  • the specific process of data block recovery As shown in FIG. 3, the method includes:
  • S201 Receive a read request sent by the external device, where the read request carries the first address.
  • the storage server determines, when the external device reads the read request of the data block at the first address, whether the number of times the first address is read in the (t2, t4) time period exceeds a third threshold. If it is exceeded, the probability that the first address is read during the (t2, t4) time period is high, and each time the external device reads the first address, the storage server indirectly accesses the second address, thereby Therefore, the storage server restores the data block at the first address when the deduplication operation is performed in order to reduce the access delay. Specifically, the storage server reads the data at the second address.
  • the data block on the second address is the same as the data block when the first address was last performed the deduplication compression operation
  • the data block on the second address is restored, Obtaining a data block when the deduplication operation is performed last time on the first address.
  • the recovery here is to decompress the data block.
  • the foregoing third threshold may be set by the user, or may be a threshold set by the storage server according to actual needs.
  • S203 Store the data block when the last deduplication compression operation is performed on the first address to the third address, and mark the first address as not undergoing the deduplication operation.
  • the method for deleting data in the embodiment of the present invention determines the first record table by using the storage server. After the number of times the first address is read in the (t2, t4) time period exceeds the third threshold, the data block at the time of the last execution of the deduplication operation on the first address is restored, thereby reducing The delay when the external device accesses the first address.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • FIG. 4 is a schematic structural diagram of Embodiment 1 of a data deduplication apparatus according to an embodiment of the present invention.
  • the device can be a storage server in the storage system or integrated in the storage server.
  • the apparatus includes: a receiving module 11, a judging module 12, and a deduplication module 13.
  • the receiving module 11 is configured to receive an overlay write request sent by the external device, where the overlay write request carries the data block and the first address of the data block to be stored, and the determining module 12 is configured to determine the first address.
  • the deduplication module 13 is configured to determine by the determining module 12 The first address does not perform a deduplication operation on the data block when the number of times of overwrite writing exceeds the first threshold in the [t1, t2] time period; when the determining module 12 determines that the first address is in [t1, t2 When the number of overwrite writes in the time period does not exceed the first threshold, the data block is subjected to a deduplication operation.
  • the deduplication device provided by the embodiment of the present invention may perform the foregoing embodiment of the deduplication method, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • the determining module 12 is specifically configured to query whether the first address exists in the first record table, and the first record table is configured to record that the number of overwrites exceeds the number of times in the [t1, t2] time period. The address of the first threshold.
  • the deduplication device provided by the embodiment of the present invention may perform the foregoing embodiment of the deduplication method, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 5 is a schematic structural diagram of Embodiment 2 of a data deduplication apparatus according to an embodiment of the present invention.
  • the apparatus may further include: a counting module 14 configured to: when the determining module 12 determines that the first address is in the [t1, t2] time period When the number of internal overwrite writes does not exceed the first threshold, the first address is overwritten by the number of times of overwriting in the [t1, t2] time period; and the recording module 15 is configured to: when the determining module 12 determines the Overwrite write on the first address When the number exceeds the first threshold in the [t1, t2] time period, the first address is recorded in the first record table, and the first address is pointed to the second in the lookup table An address; wherein the lookup table includes a mapping relationship between the second address and a fingerprint of the data block.
  • the foregoing recording module 15 is further configured to: when the number of overwrites of the first address in the (t2, t3) time period does not exceed the second threshold, the first address is from the first record table Deleted; wherein the t3 is a time point greater than t2.
  • the deduplication device provided by the embodiment of the present invention may perform the foregoing embodiment of the deduplication method, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 6 is a schematic structural diagram of Embodiment 3 of a data deduplication apparatus according to an embodiment of the present invention.
  • the receiving module 11 is further configured to receive a read request sent by the external device, where the read request carries the first address, and the device may further include: data
  • the recovery module 16 is further configured to resume the last execution of the first address when the determining module 12 determines that the number of times the first address is read in the period of (t2, t4) exceeds a third threshold.
  • the data recovery module 16 may further include: a data reading unit 161, configured to read a data block on the second address; and a data recovery unit 162, configured to use the data block on the second address Recovering to obtain a data block when the deduplication operation is performed last time on the first address; and storing a marking unit 163, configured to: execute the data block when the deduplication operation is last performed on the first address Stores to a third address and marks the first address as not being subjected to a deduplication operation.
  • a data reading unit 161 configured to read a data block on the second address
  • a data recovery unit 162 configured to use the data block on the second address Recovering to obtain a data block when the deduplication operation is performed last time on the first address
  • storing a marking unit 163, configured to: execute the data block when the deduplication operation is last performed on the first address Stores to a third address and marks the first address as not being subjected to a deduplication operation.
  • the deduplication device provided by the embodiment of the present invention may perform the foregoing embodiment of the deduplication method, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 7 is a schematic structural diagram of Embodiment 4 of a data deduplication apparatus according to an embodiment of the present invention.
  • the apparatus may include a central processing unit 20 and a memory 21, the central processing unit 20 and the memory 21 communicating via a bus, the memory 21 storing computer execution instructions, and the central processing unit 20 executing the The computer executes the instructions for performing the technical solution shown in the embodiment of the method of the present invention.
  • the implementation principle and technical effects are similar, and are not described herein again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

一种重复数据删除方法和装置。该方法包括:接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址(S101);判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其中,所述t1和t2均为时间点,所述t2大于t1;若是,则不对所述数据块执行重删操作;若否,则对所述数据块执行重删操作(S102)。从而节省了大量的存储服务器的计算资源,同时也减小了重删操作对数据存储网络性能的影响。

Description

重复数据删除方法和装置 技术领域
本发明实施例涉及计算机技术,尤其涉及一种重复数据删除方法和装置。
背景技术
现有技术中,存储服务器将接收到的数据全部进行重删压缩处理。具体的:对于主存储领域中已经存储过的数据块来说,当外部设备向存储服务器的某一逻辑地址(假设为地址0)发送数据时,存储服务器首先会将该数据进行分块,分块之后采用对应的哈希算法计算该数据块的指纹;并将该数据块的指纹传入指纹库中进行查询(该指纹库中存储的是存储空间中已存储的数据块对应的指纹),以判断该数据块是否为重复块;若是,则执行重删操作,即将该重复块删除,并将存储空间中的与该重复块相同的数据块(该与重复块相同的数据块在存储空间中的地址为地址1,地址1为物理地址)的引用次数加1,并将地址0指向地址1;如果为唯一块,则将该数据块保存到存储空间,当然在保存之前可以选择对该数据块执行压缩操作,并相应的分配物理地址存储该数据块。
但是,当地址0上的重复块被删除后,外部设备有可能仍然会不断的向该地址发送写请求,以对地址0上的数据进行覆盖写,所以上述对该地址0上的数据执行的重删操作将没有意义,从而浪费存储服务器的计算资源。
发明内容
本发明实施例提供一种重复数据删除方法和装置,用以解决现有技术对数据块执行重删操作时导致存储服务器资源浪费的技术问题。
第一方面,本发明实施例提供一种重复数据删除方法,包括:
接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址;
判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其 中,所述t1和t2均为时间点,所述t2大于t1;
若是,则不对所述数据块执行重删操作;
若否,则对所述数据块执行重删操作。
结合第一方面,在第一方面的第一种可能的实施方式中,所述判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值,具体包括:
查询第一记录表中是否存在所述第一地址;所述第一记录表用于记录在所述[t1,t2]时间段内覆盖写次数超过所述第一阈值的地址。
结合第一方面或第一方面的第一种可能的实施方式,在第一方面的第二种可能的实施方式中,当判断所述第一地址在所述[t1,t2]时间段内覆盖写次数没有超过第一阈值,所述方法还包括:
将所述第一地址在所述[t1,t2]时间段内进行覆盖写次数加1。
结合第一方面的第二种可能的实施方式,在第一方面的第三种可能的实施方式中,当所述第一地址上的覆盖写次数在所述[t1,t2]时间段内超过所述第一阈值,所述方法还包括:
将所述第一地址记录在所述第一记录表中,并将所述第一地址指向查找表中的第二地址;其中,所述查找表包括所述第二地址与所述数据块的指纹的映射关系。
结合第一方面的第三种可能的实施方式,在第一方面的第四种可能的实施方式中,若所述第一地址在(t2,t3]时间段内的覆盖写次数没有超过第二阈值,则将所述第一地址从所述第一记录表中删除;其中,所述t3为大于t2的时间点。
结合第一方面的第三种可能的实施方式,在第一方面的第五种可能的实施方式中,所述方法还包括:
接收所述外部设备发送的读请求,所述读请求中携带所述第一地址;
若判断所述第一地址在(t2,t4]时间段内被读取的次数超过第三阈值,则恢复所述第一地址上最后一次执行所述重删操作时的数据块;其中,所述t4为大于t2的时间点。
结合第一方面的第五种可能的实施方式,在第一方面的第六种可能的实施方式中,所述恢复所述第一地址上最后一次执行所述重删操作时的数据块,具体包括:
读取所述第二地址上的数据块;
对所述第二地址上的数据块进行恢复,以获取所述第一地址上最后一次执行所述重删操作时的数据块;
将所述第一地址上最后一次执行所述重删操作时的数据块存储至第三地址,并将所述第一地址标记为未经过重删操作。
第二方面,本发明实施例提供一种重复数据删除装置,包括:
接收模块,用于接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址;
判断模块,用于判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其中,所述t1和t2均为时间点,所述t2大于t1;
重删模块,用于当所述判断模块判断所述第一地址在[t1,t2]时间段内覆盖写次数超过第一阈值时,不对所述数据块执行重删操作;当所述判断模块判断所述第一地址在[t1,t2]时间段内覆盖写次数没有超过第一阈值时,对所述数据块执行重删操作。
结合第二方面,在第二方面的第一种可能的实施方式中,所述判断模块,具体用于查询第一记录表中是否存在所述第一地址;所述第一记录表用于记录在所述[t1,t2]时间段内覆盖写次数超过所述第一阈值的地址。
结合第二方面或第二方面的第一种可能的实施方式,在第二方面的第二种可能的实施方式中,所述装置还包括:
计数模块,用于当所述判断模块判断所述第一地址在所述[t1,t2]时间段内覆盖写次数没有超过第一阈值时,将所述第一地址在所述[t1,t2]时间段内进行覆盖写次数加1。
结合第二方面的第二种可能的实施方式,在第二方面的第三种可能的实施方式中,所述装置还包括:
记录模块,用于当所述判断模块判断所述第一地址上的覆盖写次数在所述[t1,t2]时间段内超过所述第一阈值时,将所述第一地址记录在所述第一记录表中,并将所述第一地址指向查找表中的第二地址;其中,所述查找表包括所述第二地址与所述数据块的指纹的映射关系。
结合第二方面的第三种可能的实施方式,在第二方面的第四种可能的实施 方式中,所述记录模块,还用于当所述第一地址在(t2,t3]时间段内的覆盖写次数没有超过第二阈值时,将所述第一地址从所述第一记录表中删除;其中,所述t3为大于t2的时间点。
结合第二方面的第三种可能的实施方式,在第二方面的第五种可能的实施方式中,所述接收模块,还用于接收所述外部设备发送的读请求,所述读请求中携带所述第一地址;
则所述装置还包括:
数据恢复模块,还用于当所述判断模块判断所述第一地址在(t2,t4]时间段内被读取的次数超过第三阈值时,恢复所述第一地址上最后一次执行所述重删操作时的数据块;其中,所述t4为大于t2的时间点。
结合第二方面的第五种可能的实施方式,在第二方面的第六种可能的实施方式中,所述数据恢复模块,具体包括:
数据读取单元,用于读取所述第二地址上的数据块;
数据恢复单元,用于对所述第二地址上的数据块进行恢复,以获取所述第一地址上最后一次执行所述重删操作时的数据块;
存储标记单元,用于将所述第一地址上最后一次执行所述重删操作时的数据块存储至第三地址,并将所述第一地址标记为未经过重删操作。
第三方面,本发明实施例提供了一种重复数据删除装置,包括中央处理器和存储器,所述中央处理器和存储器通过总线通信,所述存储器存储计算机执行指令,所述中央处理器执行所述计算机执行指令,用于执行本发明实施例第一方面或第一方面的第一至第六任一可能的实施方式。
本发明实施例提供的重复数据删除方法和装置,通过存储服务器接收外部设备发送的携带数据块和第一地址的覆盖写请求,并判断该第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值,并在判断第一地址在[t1,t2]时间段内覆盖写次数超过第一阈值时,不对第一地址的数据块进行重删操作,从而节省大量的存储服务器的计算资源,同时也减小了重删操作对数据存储网络性能的影响。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明提供的重复数据删除方法实施例一的流程示意图;
图2为本发明提供的存储系统网络拓扑图;
图3为本发明提供的重复数据删除方法实施例二的流程示意图;
图4为本发明实施例提供的重复数据删除装置实施例一的结构示意图;
图5为本发明实施例提供的重复数据删除装置实施例二的结构示意图;
图6为本发明实施例提供的重复数据删除装置实施例三的结构示意图;
图7为本发明实施例提供的重复数据删除装置实施例四的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1为本发明提供的重复数据删除方法实施例一的流程示意图。重复数据删除,以下简称为重删。该方法的执行主体可以为存储服务器,具体可以为存储服务器中的重删模块。如图1所示,该方法包括:
S101:接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址。
具体的,本发明实施例可以适用于如图2所示的存储系统网络拓扑图。外部设备经过数据存储网络,例如:存储区域网络(Storage Area Network,以下简称SAN)向存储服务器发送覆盖写请求,该覆盖写请求中携带数据块和待存储该数据块的第一地址。可选的,该第一地址可以为逻辑地址。
S102:判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其中,所述t1和t2均为时间点,所述t2大于t1;若是,则不对所述数据块 执行重删操作;若否,则对所述数据块执行重删操作。
具体的,存储服务器接收到上述覆盖写请求之后,判断覆盖写请求中携带的第一地址在[t1,t2]时间段内的覆盖写次数是否超过第一阈值。可选的,该[t1,t2]时间段可以通过相应的软件设置,例如:通过定时器的定时软件设置。该第一阈值可以为用户设置的,也可以为存储服务器根据实际需要设置的阈值。
当存储服务器判断第一地址在[t1,t2]时间段内的覆盖写次数超过第一阈值时,存储服务器不对上述数据块进行重删操作,而是按照现有技术对数据块的处理流程进行处理,例如,将该数据块写入对应的物理地址或逻辑地址中(因为存储服务器知道在短时间内还会不断的有数据块到达第一地址,因此不再对该数据块进行重删操作了)。
当存储服务器判断在[t1,t2]时间段内的覆盖写次数没有超过第一阈值时,存储服务器对该数据块执行重删操作,即:若该数据块是重复数据块,就将该重复数据块删除;若该数据块是唯一数据块,则保该数据块,并且记录该数据块的指纹和引用计数。可选的,可以对该数据块进行压缩,并将压缩后的数据块存储至某一物理地址或逻辑地址中,还可以直接向该数据块存储至某一物理地址或逻辑地址中。
由于上述存储服务器对第一地址在[t1,t2]时间段内的覆盖写次数是否超过第一阈值进行了判断,使得存储服务器不再对任何地址都执行重删操作,而是对发生覆盖写次数未超过第一阈值的地址上的数据块才进行重删操作,对发生覆盖写次数超过第一阈值的地址上的数据块不进行重删操作,从而节省了大量的存储服务器的计算资源,同时也减小了重删操作对数据存储网络性能的影响。
本发明实施例提供的重复数据删除方法,通过存储服务器接收外部设备发送的携带数据块和第一地址的覆盖写请求,并判断该第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值,并在判断第一地址在[t1,t2]时间段内覆盖写次数超过第一阈值时,不对第一地址的数据块进行重删操作,从而节省大量的存储服务器的计算资源,同时也减小了重删操作对数据存储网络性能的影响。
在上述实施例的基础上,本实施例涉及的方法是存储服务器判断第一记录表中是否存在第一地址,从而确定是否对第一地址的数据块进行重删操作的过程。在上述图1所示实施例的基础上,进一步地,上述S102具体包括:查询第一记录表中是否存在所述第一地址;所述第一记录表用于记录在所述[t1,t2]时间段内覆盖写次数超过所述第一阈值的地址。
具体的,存储服务器接收到上述覆盖写请求之后,查询第一记录表中是否存在覆盖写请求中携带的第一地址。该第一记录表中可以包括一个或多个地址,这些地址均为在[t1,t2]时间段内覆盖写次数超过第一阈值的地址,即发生覆盖写的概率比较高的地址,这些地址均可以为逻辑地址。可选的,这些地址在第一记录表中可以是以地址集合的形式存在,也可以是以地址与该地址上的覆盖写次数的映射关系的形式存在,本发明实施例对第一记录表中的地址的存储形式并不做限制。
当存储服务器判断第一记录表中存在第一地址时,存储服务器不对上述数据块进行重删操作,而是按照现有技术对数据块的处理流程进行处理,例如,将该数据块写入对应的物理地址或逻辑地址中(因为存储服务器知道在短时间内还会不断的有数据块到达第一地址,因此不再对该数据块进行重删操作了)。
当存储服务器判断第一记录表中不存在第一地址时(也就是第一地址在[t1,t2]时间段内的覆盖写次数没有超过第一阈值),存储服务器对该数据块执行重删操作,即:若该数据块是重复数据块,就将该重复数据块删除;若该数据块是唯一数据块,则保该数据块,并且记录该数据块的指纹和引用计数。可选的,可以对该数据块进行压缩,并将压缩后的数据块存储至某一物理地址或逻辑地址中,还可以直接向该数据块存储至某一物理地址或逻辑地址中。
本发明实施例提供的重复数据删除方法,通过存储服务器接收外部设备发送的携带数据块和第一地址的覆盖写请求,并查询第一记录表中是否存在第一地址,并在第一记录表中存在第一地址时,不对第一地址的数据块进行重删操作,从而节省大量的存储服务器的计算资源,同时也减小了重删操作对数据存储网络性能的影响。
在上述实施例的基础上,当存储服务器判断第一记录表中不存在第一地 址,也就是说当存储服务器判断该第一地址在[t1,t2]时间段内覆盖写次数没有超过第一阈值时,存储服务器将该第一地址在[t1,t2]时间段内进行覆盖写次数加1。当第一地址上再次发生覆盖写时,存储服务器仍然会判断第一记录表中是否存在第一地址,若不存在,则存储服务器不仅对该第一地址的数据块进行重删操作,还会将该第一地址在[t1,t2]时间段内覆盖写次数再次加1,以此类推。
当上述第一地址在[t1,t2]时间段内覆盖写次数累加到超过第一阈值时,存储服务器将该第一地址记录在第一记录表中,并将第一地址指向查找表中的第二地址;其中,该查找表包括所述第二地址与所述数据块的指纹的映射关系。
具体的,为了更方便的说明本实施例的技术方案,此处举一个简单的例子:
假设第一阈值为10次,且第一记录表中不存在第一地址(即第一地址在[t1,t2]时间段内没有超过第一阈值),则存储服务器要对上述待存储至第一地址上的数据块进行重删操作;同时,将第一地址在[t1,t2]时间段内进行覆盖写次数加1。假设加1之后,当前第一地址在[t1,t2]时间段内的覆盖写次数为9次,当第一地址上发生第10次覆盖写时(即存储服务器接收到的依然是将某一数据块写入第一地址的覆盖写请求),存储服务器依然会将上述待存储至第一地址上的数据块进行重删操作,但是存储服务器此时会将该第一地址存储至第一记录表中。
若第一地址上发生第10次覆盖写时的数据块是唯一块,则存储服务器对该数据块进行重删操作后,并将压缩后的数据块存储至查找表中的第二地址(该第二地址是存储服务器在查找表中为该唯一块分配的新地址,并且,存储服务器建立该唯一块与该第二地址之间的映射关系),并将第一地址指向该第二地址,这样在外部设备访问第一地址时,就可以间接的访问到第二地址上的数据块了。
若第一地址上发生第10次覆盖写时的数据块是重复块,则存储服务器对该数据块进行重删操作后,根据第一地址上的重复块的指纹在查找表中查找存储该重复块的地址。由于查找表中,该重复块的指纹对应的是第二地址(即第二地址上存储的数据块与该重复块是相同的),因此,存储服务器将第一地址指向第二地址。这样在外部设备访问第一地址时,也可以间接的访问到第二地址 上的数据块了。
更进一步地,当第一地址记录在第一记录表之后,存储服务器依然会监测该第一地址在(t2,t3]时间段内所发生的覆盖写次数是否超过第二阈值,若没有超过,说明该第一地址在(t2,t3]时间段内发生覆盖写的次数概率很低或者根本没有发生覆盖写,则存储服务器就将该第一地址从第一记录表中删除。可选的,该第二阈值可以为0,也可以为大于0的整数。上述t3为大于t2的时间点。
图3为本发明实施例提供的重复数据删除方法实施例二的流程示意图。本实施例涉及的方法是在当第一地址记录在第一记录表之后,存储服务器判断外部设备读取第一地址的次数超过一定阈值,对该第一地址上最后一次执行重删操作时的数据块进行恢复的具体过程。如图3所示,该方法包括:
S201:接收所述外部设备发送的读请求,所述读请求中携带所述第一地址。
S202:若判断所述第一地址在(t2,t4]时间段内被读取的次数超过第三阈值,则恢复所述第一地址上最后一次执行所述重删操作时的数据块;其中,所述t4为大于t2的时间点。
具体的,存储服务器在接收到外部设备读取第一地址上的数据块的读请求时,会判断该第一地址在(t2,t4]时间段内被读取的次数是否超过第三阈值,若超过,则说明第一地址在(t2,t4]时间段内被读取的概率很高,而外部设备每次读取第一地址时,存储服务器都会去间接的访问第二地址,从而会带来相应的访问时延。因此,存储服务器为了降低访问时延,会对第一地址上最后一次执行重删操作时的数据块进行恢复。具体为:存储服务器读取第二地址上的数据块(因为第一地址指向第二地址,第二地址上的数据块与第一地址在最后一次执行重删压缩操作时的数据块相同),并对该第二地址上的数据块进行恢复,以获取第一地址上最后一次执行所述重删操作时的数据块。可选的,若数据块在存储至第二地址时进行了压缩,则这里的恢复就是对数据块进行解压缩。可选的,上述第三阈值可以为用户设置的,也可以为存储服务器根据实际需要设置的阈值。
S203:将上述第一地址上最后一次执行重删压缩操作时的数据块存储至第三地址,并将上述第一地址标记为未经过重删操作。
本发明实施例提供的重复数据删除方法,通过存储服务器判断第一记录表 中的第一地址在(t2,t4]时间段内被读取的次数超过第三阈值后,对该第一地址上的最后一次执行所述重删操作时的数据块进行恢复,从而降低了外部设备访问第一地址时的时延。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
图4为本发明实施例提供的重复数据删除装置实施例一的结构示意图。该装置可以为存储系统中的存储服务器,也可以集成在存储服务器中。如图4所示,该装置包括:接收模块11、判断模块12和重删模块13。其中,接收模块11,用于接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址;判断模块12,用于判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其中,所述t1和t2均为时间点,所述t2大于t1;重删模块13,用于当所述判断模块12判断所述第一地址在[t1,t2]时间段内覆盖写次数超过第一阈值时,不对所述数据块执行重删操作;当所述判断模块12判断所述第一地址在[t1,t2]时间段内覆盖写次数没有超过第一阈值时,对所述数据块执行重删操作。
本发明实施例提供的重复数据删除装置,可以执行上述重复数据删除方法实施例,其实现原理和技术效果类似,在此不再赘述。
进一步地,上述判断模块12,具体用于查询第一记录表中是否存在所述第一地址;所述第一记录表用于记录在所述[t1,t2]时间段内覆盖写次数超过所述第一阈值的地址。
本发明实施例提供的重复数据删除装置,可以执行上述重复数据删除方法实施例,其实现原理和技术效果类似,在此不再赘述。
图5为本发明实施例提供的重复数据删除装置实施例二的结构示意图。在上述图4所示的实施例的基础上,进一步地,该装置还可以包括:计数模块14,用于当所述判断模块12判断所述第一地址在所述[t1,t2]时间段内覆盖写次数没有超过第一阈值时,将所述第一地址在所述[t1,t2]时间段内进行覆盖写次数加1;记录模块15,用于当所述判断模块12判断所述第一地址上的覆盖写次 数在所述[t1,t2]时间段内超过所述第一阈值时,将所述第一地址记录在所述第一记录表中,并将所述第一地址指向查找表中的第二地址;其中,所述查找表包括所述第二地址与所述数据块的指纹的映射关系。
进一步地,上述记录模块15,还用于当所述第一地址在(t2,t3]时间段内的覆盖写次数没有超过第二阈值时,将所述第一地址从所述第一记录表中删除;其中,所述t3为大于t2的时间点。
本发明实施例提供的重复数据删除装置,可以执行上述重复数据删除方法实施例,其实现原理和技术效果类似,在此不再赘述。
图6为本发明实施例提供的重复数据删除装置实施例三的结构示意图。在上述图5所示实施例的基础上,上述接收模块11,还用于接收所述外部设备发送的读请求,所述读请求中携带所述第一地址;则该装置还可以包括:数据恢复模块16,还用于当所述判断模块12判断所述第一地址在(t2,t4]时间段内被读取的次数超过第三阈值时,恢复所述第一地址上最后一次执行所述重删操作时的数据块;其中,所述t4为大于t2的时间点。
进一步地,该数据恢复模块16,具体可以包括:数据读取单元161,用于读取所述第二地址上的数据块;数据恢复单元162,用于对所述第二地址上的数据块进行恢复,以获取所述第一地址上最后一次执行所述重删操作时的数据块;存储标记单元163,用于将所述第一地址上最后一次执行所述重删操作时的数据块存储至第三地址,并将所述第一地址标记为未经过重删操作。
本发明实施例提供的重复数据删除装置,可以执行上述重复数据删除方法实施例,其实现原理和技术效果类似,在此不再赘述。
图7为本发明实施例提供的重复数据删除装置实施例四的结构示意图。如图7所示,该装置可以包括:中央处理器20和存储器21,所述中央处理器20和存储器21通过总线通信,所述存储器21存储计算机执行指令,所述中央处理器20执行所述计算机执行指令,用于执行本发明方法实施例所示的技术方案,其实现原理和技术效果类似,此处不再赘述。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者 对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (15)

  1. 一种重复数据删除方法,其特征在于,包括:
    接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址;
    判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其中,所述t1和t2均为时间点,所述t2大于t1;
    若是,则不对所述数据块执行重删操作;
    若否,则对所述数据块执行重删操作。
  2. 根据权利要求1所述的方法,其特征在于,所述判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值,具体包括:
    查询第一记录表中是否存在所述第一地址;所述第一记录表用于记录在所述[t1,t2]时间段内覆盖写次数超过所述第一阈值的地址。
  3. 根据权利要求1或2所述的方法,其特征在于,当判断所述第一地址在所述[t1,t2]时间段内覆盖写次数没有超过第一阈值,所述方法还包括:
    将所述第一地址在所述[t1,t2]时间段内进行覆盖写次数加1。
  4. 根据权利要求3所述的方法,其特征在于,当所述第一地址上的覆盖写次数在所述[t1,t2]时间段内超过所述第一阈值,所述方法还包括:
    将所述第一地址记录在所述第一记录表中,并将所述第一地址指向查找表中的第二地址;其中,所述查找表包括所述第二地址与所述数据块的指纹的映射关系。
  5. 根据权利要求4所述的方法,其特征在于,若所述第一地址在(t2,t3]时间段内的覆盖写次数没有超过第二阈值,则将所述第一地址从所述第一记录表中删除;其中,所述t3为大于t2的时间点。
  6. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    接收所述外部设备发送的读请求,所述读请求中携带所述第一地址;
    若判断所述第一地址在(t2,t4]时间段内被读取的次数超过第三阈值,则恢复所述第一地址上最后一次执行所述重删操作时的数据块;其中,所述t4为大于t2的时间点。
  7. 根据权利要求6所述的方法,其特征在于,所述恢复所述第一地址上最后一次执行所述重删操作时的数据块,具体包括:
    读取所述第二地址上的数据块;
    对所述第二地址上的数据块进行恢复,以获取所述第一地址上最后一次执行所述重删操作时的数据块;
    将所述第一地址上最后一次执行所述重删操作时的数据块存储至第三地址,并将所述第一地址标记为未经过重删操作。
  8. 一种重复数据删除装置,其特征在于,包括:
    接收模块,用于接收外部设备发送的覆盖写请求,所述覆盖写请求中携带数据块和待存储所述数据块的第一地址;
    判断模块,用于判断所述第一地址在[t1,t2]时间段内覆盖写次数是否超过第一阈值;其中,所述t1和t2均为时间点,所述t2大于t1;
    重删模块,用于当所述判断模块判断所述第一地址在[t1,t2]时间段内覆盖写次数超过第一阈值时,不对所述数据块执行重删操作;当所述判断模块判断所述第一地址在[t1,t2]时间段内覆盖写次数没有超过第一阈值时,对所述数据块执行重删操作。
  9. 根据权利要求8所述的装置,其特征在于,所述判断模块,具体用于查询第一记录表中是否存在所述第一地址;所述第一记录表用于记录在所述[t1,t2]时间段内覆盖写次数超过所述第一阈值的地址。
  10. 根据权利要求8或9所述的装置,其特征在于,所述装置还包括:
    计数模块,用于当所述判断模块判断所述第一地址在所述[t1,t2]时间段内覆盖写次数没有超过第一阈值时,将所述第一地址在所述[t1,t2]时间段内进行覆盖写次数加1。
  11. 根据权利要求10所述的装置,其特征在于,所述装置还包括:
    记录模块,用于当所述判断模块判断所述第一地址上的覆盖写次数在所述[t1,t2]时间段内超过所述第一阈值时,将所述第一地址记录在所述第一记录表中,并将所述第一地址指向查找表中的第二地址;其中,所述查找表包括所述第二地址与所述数据块的指纹的映射关系。
  12. 根据权利要求11所述的装置,其特征在于,所述记录模块,还用于当所述第一地址在(t2,t3]时间段内的覆盖写次数没有超过第二阈值时,将所述第一地址从所述第一记录表中删除;其中,所述t3为大于t2的时间点。
  13. 根据权利要求11所述的装置,其特征在于,所述接收模块,还用于接收所述外部设备发送的读请求,所述读请求中携带所述第一地址;
    则所述装置还包括:
    数据恢复模块,还用于当所述判断模块判断所述第一地址在(t2,t4]时间段内被读取的次数超过第三阈值时,恢复所述第一地址上最后一次执行所述重删操作时的数据块;其中,所述t4为大于t2的时间点。
  14. 根据权利要求13所述的装置,其特征在于,所述数据恢复模块,具体包括:
    数据读取单元,用于读取所述第二地址上的数据块;
    数据恢复单元,用于对所述第二地址上的数据块进行恢复,以获取所述第一地址上最后一次执行所述重删操作时的数据块;
    存储标记单元,用于将所述第一地址上最后一次执行所述重删操作时的数据块存储至第三地址,并将所述第一地址标记为未经过重删操作。
  15. 一种重复数据删除装置,其特征在于,包括中央处理器和存储器,所述中央处理器和存储器通过总线通信,所述存储器存储计算机执行指令,所述中央处理器执行所述计算机执行指令,用于执行权利要求1-7任一所述的方法。
PCT/CN2015/080906 2014-09-17 2015-06-05 重复数据删除方法和装置 WO2016041384A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP19162092.1A EP3564844B1 (en) 2014-09-17 2015-06-05 Data deduplication method and apparatus
EP17208407.1A EP3361409B1 (en) 2014-09-17 2015-06-05 Data deduplication method and apparatus
EP15841499.5A EP3153987B1 (en) 2014-09-17 2015-06-05 Duplicate data deletion method and device
US15/403,318 US10564880B2 (en) 2014-09-17 2017-01-11 Data deduplication method and apparatus
US16/738,401 US11531482B2 (en) 2014-09-17 2020-01-09 Data deduplication method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410475287.3A CN104239518B (zh) 2014-09-17 2014-09-17 重复数据删除方法和装置
CN201410475287.3 2014-09-17

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/403,318 Continuation US10564880B2 (en) 2014-09-17 2017-01-11 Data deduplication method and apparatus

Publications (1)

Publication Number Publication Date
WO2016041384A1 true WO2016041384A1 (zh) 2016-03-24

Family

ID=52227577

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/080906 WO2016041384A1 (zh) 2014-09-17 2015-06-05 重复数据删除方法和装置

Country Status (4)

Country Link
US (2) US10564880B2 (zh)
EP (3) EP3153987B1 (zh)
CN (1) CN104239518B (zh)
WO (1) WO2016041384A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239518B (zh) 2014-09-17 2017-09-29 华为技术有限公司 重复数据删除方法和装置
US9805099B2 (en) 2014-10-30 2017-10-31 The Johns Hopkins University Apparatus and method for efficient identification of code similarity
US10656991B2 (en) * 2015-08-24 2020-05-19 International Business Machines Corporation Electronic component having redundant product data stored externally
SG11201707075SA (en) 2015-12-29 2017-09-28 Huawei Tech Co Ltd Deduplication method and storage device
CN105787037B (zh) * 2016-02-25 2019-03-15 浪潮(北京)电子信息产业有限公司 一种重复数据的删除方法及装置
US10788988B1 (en) 2016-05-24 2020-09-29 Violin Systems Llc Controlling block duplicates
CN106095332A (zh) * 2016-06-01 2016-11-09 杭州宏杉科技有限公司 一种数据重删方法及装置
CN111427855B (zh) * 2016-09-28 2024-04-12 华为技术有限公司 一种存储系统中重复数据删除方法、存储系统及控制器
CN107798047B (zh) * 2017-07-26 2021-03-02 深圳壹账通智能科技有限公司 重复工单检测方法、装置、服务器和介质
CN107632786B (zh) * 2017-09-20 2020-04-07 杭州宏杉科技股份有限公司 一种数据重删的管理方法及装置
CN108121504B (zh) * 2017-11-16 2021-01-29 成都华为技术有限公司 数据删除方法及装置
EP3867739A1 (en) * 2019-07-23 2021-08-25 Huawei Technologies Co., Ltd. Devices, system and methods for deduplication
US11269532B2 (en) * 2019-10-30 2022-03-08 EMC IP Holding Company LLC Data reduction by replacement of repeating pattern with single instance
CN111522843B (zh) * 2020-06-01 2023-06-27 北京创鑫旅程网络技术有限公司 数据平台的控制方法、系统、设备及存储介质
CN113709510A (zh) * 2021-08-06 2021-11-26 联想(北京)有限公司 高速率数据实时传输方法及装置、设备、存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271454A1 (en) * 2008-04-29 2009-10-29 International Business Machines Corporation Enhanced method and system for assuring integrity of deduplicated data
CN101833496A (zh) * 2010-03-25 2010-09-15 北京邮电大学 基于硬盘的主机防客体重用性能的检测装置及其检测方法
CN102968597A (zh) * 2012-11-05 2013-03-13 中国电力科学研究院 一种基于磁盘数据连接链文件粉碎方法
CN103294957A (zh) * 2013-05-06 2013-09-11 北京赛思信安技术有限公司 支持重复数据删除文件系统中数据更新时的数据保护方法
CN104239518A (zh) * 2014-09-17 2014-12-24 华为技术有限公司 重复数据删除方法和装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8108638B2 (en) * 2009-02-06 2012-01-31 International Business Machines Corporation Backup of deduplicated data
US20110055471A1 (en) * 2009-08-28 2011-03-03 Jonathan Thatcher Apparatus, system, and method for improved data deduplication
US8904120B1 (en) * 2010-12-15 2014-12-02 Netapp Inc. Segmented fingerprint datastore and scaling a fingerprint datastore in de-duplication environments
WO2012129191A2 (en) * 2011-03-18 2012-09-27 Fusion-Io, Inc. Logical interfaces for contextual storage
US8462781B2 (en) * 2011-04-06 2013-06-11 Anue Systems, Inc. Systems and methods for in-line removal of duplicate network packets
US8930307B2 (en) * 2011-09-30 2015-01-06 Pure Storage, Inc. Method for removing duplicate data from a storage array
CA2890516C (en) * 2011-11-07 2018-11-27 Nexgen Storage, Inc. Primary data storage system with quality of service
US10248582B2 (en) * 2011-11-07 2019-04-02 Nexgen Storage, Inc. Primary data storage system with deduplication
US10216651B2 (en) * 2011-11-07 2019-02-26 Nexgen Storage, Inc. Primary data storage system with data tiering
US9348538B2 (en) * 2012-10-18 2016-05-24 Netapp, Inc. Selective deduplication
BR112014009477B1 (pt) * 2012-12-12 2018-10-16 Huawei Tech Co Ltd método e aparelho de processamento de dados em um sistema de agrupamento
WO2014101130A1 (zh) * 2012-12-28 2014-07-03 华为技术有限公司 数据处理方法及装置
US9633033B2 (en) * 2013-01-11 2017-04-25 Commvault Systems, Inc. High availability distributed deduplicated storage system
US9423978B2 (en) * 2013-05-08 2016-08-23 Nexgen Storage, Inc. Journal management
US20150006475A1 (en) * 2013-06-26 2015-01-01 Katherine H. Guo Data deduplication in a file system
US9773007B1 (en) * 2014-12-01 2017-09-26 Pure Storage, Inc. Performance improvements in a storage system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271454A1 (en) * 2008-04-29 2009-10-29 International Business Machines Corporation Enhanced method and system for assuring integrity of deduplicated data
CN101833496A (zh) * 2010-03-25 2010-09-15 北京邮电大学 基于硬盘的主机防客体重用性能的检测装置及其检测方法
CN102968597A (zh) * 2012-11-05 2013-03-13 中国电力科学研究院 一种基于磁盘数据连接链文件粉碎方法
CN103294957A (zh) * 2013-05-06 2013-09-11 北京赛思信安技术有限公司 支持重复数据删除文件系统中数据更新时的数据保护方法
CN104239518A (zh) * 2014-09-17 2014-12-24 华为技术有限公司 重复数据删除方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3153987A4 *

Also Published As

Publication number Publication date
EP3153987B1 (en) 2018-03-21
CN104239518B (zh) 2017-09-29
EP3153987A1 (en) 2017-04-12
EP3361409B1 (en) 2019-11-20
EP3153987A4 (en) 2017-05-31
EP3564844A1 (en) 2019-11-06
US20170123712A1 (en) 2017-05-04
CN104239518A (zh) 2014-12-24
US20200150890A1 (en) 2020-05-14
US10564880B2 (en) 2020-02-18
EP3564844B1 (en) 2020-08-26
EP3361409A1 (en) 2018-08-15
US11531482B2 (en) 2022-12-20

Similar Documents

Publication Publication Date Title
WO2016041384A1 (zh) 重复数据删除方法和装置
US11921684B2 (en) Systems and methods for database management using append-only storage devices
US11803567B1 (en) Restoration of a dataset from a cloud
US8751763B1 (en) Low-overhead deduplication within a block-based data storage
US9436720B2 (en) Safety for volume operations
WO2016086819A1 (zh) 将数据写入叠瓦状磁记录smr硬盘的方法及装置
US10114576B2 (en) Storage device metadata synchronization
JP2017079053A (ja) ストレージジャーナリングを改善する方法およびシステム
CN107135662B (zh) 一种差异数据备份方法、存储系统和差异数据备份装置
US11360682B1 (en) Identifying duplicative write data in a storage system
WO2017157158A1 (zh) 写数据的方法及装置、计算机存储介质
JP7376488B2 (ja) スナップショットのコピーオンライトのデータ移動を回避するインフラストラクチャとしての重複除外
US8966207B1 (en) Virtual defragmentation of a storage
WO2018119998A1 (zh) 一种快照回滚方法、装置、存储控制器和系统
CN112817962B (zh) 基于对象存储的数据存储方法、装置和计算机设备
WO2018094958A1 (zh) 一种数据处理方法、装置及系统
US20190188102A1 (en) Method and system for data recovery in a cloud based computing environment utilizing object storage
US10664442B1 (en) Method and system for data consistency verification in a storage system
US11847334B2 (en) Method or apparatus to integrate physical file verification and garbage collection (GC) by tracking special segments
KR101608623B1 (ko) 전원 손실 이후 효과적인 데이터 복구를 위한 메모리 복구 장치 및 방법
US11748259B2 (en) System and method to conserve device lifetime for snapshot generation
KR20150118207A (ko) 메모리제어장치 및 메모리제어장치의 동작 방법
KR20170002279A (ko) 전원 손실 이후 효과적인 데이터 복구를 위한 메모리 복구 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15841499

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2015841499

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015841499

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE