WO2021027541A1 - 一种重复数据的删除方法及装置 - Google Patents

一种重复数据的删除方法及装置 Download PDF

Info

Publication number
WO2021027541A1
WO2021027541A1 PCT/CN2020/104846 CN2020104846W WO2021027541A1 WO 2021027541 A1 WO2021027541 A1 WO 2021027541A1 CN 2020104846 W CN2020104846 W CN 2020104846W WO 2021027541 A1 WO2021027541 A1 WO 2021027541A1
Authority
WO
WIPO (PCT)
Prior art keywords
fingerprint
record
fingerprint record
data
item
Prior art date
Application number
PCT/CN2020/104846
Other languages
English (en)
French (fr)
Inventor
任仁
王晨
郭平静
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20852903.2A priority Critical patent/EP4016276A4/en
Publication of WO2021027541A1 publication Critical patent/WO2021027541A1/zh
Priority to US17/671,224 priority patent/US20220164316A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices

Definitions

  • This application relates to the field of storage technology, and in particular to a method and device for deleting duplicate data.
  • a data deduplication technology is proposed, that is, if a certain data is stored in multiple copies in the storage system, the multiple copies of data will be deleted and only one copy of the data will be saved, thereby reducing the data by reducing the data The purpose of the occupied storage space.
  • the implementation process of one of the data deduplication technologies is as follows: First, calculate the fingerprint of each data, store the data, and record the mapping between the fingerprint and the storage address of the data. Use the stored data as the data to be deduplicated for batch deduplication.
  • Performing batch deduplication on stored duplicate data includes: querying whether the stored quantity has the same fingerprint in the fingerprint table, if there is the same fingerprint, then determining that the data is duplicate data, otherwise it is considered unique data. And delete the mapping between the fingerprint of the previous data and the storage address. It can be seen that the current data deduplication technology is to find all the fingerprints of the data to be repeated in the fingerprint table to be able to determine whether the data is duplicate data, resulting in low deduplication efficiency.
  • the present application provides a method and device for deleting duplicate data to improve the efficiency of the duplicate data deletion technology.
  • a method for deleting duplicate data is provided.
  • a fingerprint record containing multiple fingerprint record items is first obtained, wherein each fingerprint record item contains a fingerprint and a storage address of the data corresponding to the fingerprint If the two data are the same but stored in different storage addresses, different fingerprint record items will be generated for the two data respectively.
  • the two fingerprint record items include the same fingerprint but the storage address corresponding to the fingerprint is different.
  • at least two first fingerprint record items including the same fingerprint are determined from the fingerprint record. For example, the at least two first fingerprint record items include the first fingerprint, so that the at least two first fingerprint record items include the first fingerprint.
  • the data corresponding to the first fingerprint in the two first fingerprint record items is subjected to a deduplication operation, the at least two first fingerprint record items are deleted, and the stub of the first fingerprint is recorded in the fingerprint record, through the The stub of the first fingerprint indicates that the first fingerprint is a duplicate fingerprint.
  • the stub since the stub corresponding to the repeated fingerprint is added to the fingerprint record, the stub can be used to directly determine that the fingerprint included in the fingerprint record item is a repeated fingerprint, instead of using the fingerprint record as in the prior art.
  • the fingerprint table looks up all the fingerprints of the data to be repeated to be able to determine whether it is duplicate data, so this application can quickly determine the duplicate fingerprints, and perform deduplication operations on the data corresponding to the duplicate fingerprints, which can improve the deduplication technology s efficiency.
  • the fingerprint record item corresponding to the new data when new data is written in the storage system, the fingerprint record item corresponding to the new data will be recorded in the fingerprint record.
  • the fingerprint record item corresponding to the new data It is recorded as a second fingerprint record item, and the second fingerprint record item includes the first fingerprint and the storage address of the new data. Since the stub of the first fingerprint indicates that the first fingerprint is a duplicate fingerprint, and the second fingerprint record item includes the first fingerprint, the first fingerprint in the second fingerprint record item is determined to be a duplicate fingerprint, and the new fingerprint Data is deduplicated.
  • the fingerprint corresponding to the new data can be compared with the stub. If the fingerprint corresponding to the new data is the same as the fingerprint indicated by the stub, the fingerprint can be compared
  • the new data is deduplicated, so that the data can be deduplicated without querying the fingerprint table, saving the process of querying the fingerprint table, and improving the efficiency of the deduplication technology.
  • the second fingerprint record item after deleting the newly written data corresponding to the second fingerprint record item, the second fingerprint record item may be deleted.
  • deleting invalid fingerprint records can reduce the storage space occupied by fingerprint records, and can improve the utilization of storage space.
  • the third fingerprint record item in the fingerprint record table can be deleted.
  • the third fingerprint record item includes the fingerprint and the fingerprint Other fingerprint record items in the record include different fingerprints. In other words, delete the fingerprint record items placed in the fingerprint record.
  • the storage space occupied by fingerprint records becomes larger and larger, and the storage space occupied by it takes a certain period of time to be greater than or equal to the first threshold . If a fingerprint record item is placed in the order within this time period, it means that the probability of repeated storage of the data corresponding to the fingerprint is very small, and the fingerprint record item needs to wait a longer time before the deduplication operation can be performed, which can be directly Delete the fingerprint record item to save the storage space occupied by the fingerprint record.
  • the fourth fingerprint record item when the storage space occupied by the fingerprint record is greater than or equal to the first threshold, the fourth fingerprint record item can be deleted, and the fourth fingerprint record item is stored in the fingerprint record for longer than or Equal to the second threshold. In other words, delete the fingerprint record items written earlier in the fingerprint record.
  • the data if the data has been overwritten, the data will not be stored repeatedly in the storage system, and there is no need to perform a data deduplication operation on the data.
  • the earlier the fingerprint record item is written into the fingerprint record the greater the possibility that the data corresponding to the fingerprint record item will be overwritten by the new data, and the earlier fingerprint record item can be written into the fingerprint record. Delete to save the storage space occupied by fingerprint records.
  • the fifth fingerprint record item in the fingerprint record table can be deleted, and the fingerprint record has not recorded a predetermined number within a predetermined time The fifth fingerprint record item. In other words, delete the fingerprint record items that appear less frequently in the fingerprint record.
  • a fingerprint record item appears less frequently in a predetermined time, it means that the probability of repeatedly storing the data corresponding to the fingerprint is small, so the fingerprint record item can be deleted directly to save Storage space occupied by fingerprint records.
  • the storage space occupied by the fingerprint record is greater than or equal to the first fingerprint.
  • a threshold it can be determined whether the fingerprint record records a predetermined number of third fingerprint record items containing the second fingerprint within a predetermined time, if the fingerprint record does not record the second predetermined number of the second fingerprint within the predetermined time Three fingerprint record items, delete the stub of the second fingerprint in the fingerprint record.
  • the fingerprint stub if after recording a stub of a fingerprint in the fingerprint record, the fingerprint is recorded in the subsequent record, and there are fewer fingerprint record items corresponding to the fingerprint recorded, it means that the duplicate fingerprint is determined by the fingerprint stub The number of times is less, that is, the fingerprint stub contributes less to the determination of duplicate fingerprints, so that the fingerprint stub can be deleted to save the storage space occupied by the fingerprint record.
  • the first threshold, the second threshold, the predetermined number, and the predetermined time are not restricted.
  • a device for deleting duplicate data may be a storage server or a device in a storage server.
  • the device for deleting duplicate data includes a processor for implementing the method described in the first aspect.
  • the device for deleting duplicate data may also include a memory for storing program instructions and data.
  • the memory is coupled with the processor, and the processor can call and execute the program instructions stored in the memory to implement any one of the methods described in the first aspect.
  • the device for deleting duplicate data may further include a communication interface for communicating with other devices. Exemplarily, the other device is a client in the storage system.
  • the device for deleting duplicate data includes a processor and a communication interface, where:
  • the communication interface is used to obtain a fingerprint record.
  • the fingerprint record contains multiple fingerprint record items, and each fingerprint record item contains a fingerprint;
  • the processor is configured to determine at least two first fingerprint record items from the fingerprint record; wherein each first fingerprint record item includes a first fingerprint and a storage address of data corresponding to the first fingerprint; the at least two The storage addresses of the data corresponding to the first fingerprint of the first fingerprint record item are all different; and,
  • the stub of the first fingerprint is recorded in the fingerprint record; wherein, the stub of the first fingerprint is used for fingerprinting, and the first fingerprint is a duplicate fingerprint.
  • the processor is also used to:
  • the second fingerprint record item contains the first fingerprint and the new storage address of the data corresponding to the first fingerprint; wherein, the first fingerprint record item in the second fingerprint record The data corresponding to the fingerprint is the newly written data;
  • the processor is also used to:
  • the processor is also used to:
  • the third fingerprint record item is deleted.
  • the fingerprint included in the third fingerprint record item is different from the fingerprints included in other fingerprint record items in the fingerprint record.
  • the processor is also used to:
  • the fourth fingerprint record item is deleted, and the duration of the fourth fingerprint record item stored in the fingerprint record is greater than or equal to the second threshold.
  • the processor is also used to:
  • the fifth fingerprint record item in the fingerprint record table is deleted, and the fingerprint record does not record a predetermined number of fifth fingerprint record items within a predetermined time.
  • the processor is also used to:
  • the fingerprint record When the fingerprint record does not record the predetermined number of third fingerprint record items within the predetermined time, delete the stub of the second fingerprint in the fingerprint record; wherein the stub of the second fingerprint is used to fingerprint the second fingerprint Is a duplicate fingerprint; the third fingerprint record item contains the second fingerprint.
  • a device for deduplication of data can be a storage server or a device in a storage server.
  • the data deduplication device may include a processing module and a communication module. These modules may Perform the corresponding functions performed in any of the design examples in the first aspect, specifically:
  • the communication module is used to obtain a fingerprint record.
  • the fingerprint record contains multiple fingerprint record items, and each fingerprint record item contains a fingerprint;
  • the processing module is configured to determine at least two first fingerprint record items from the fingerprint record; wherein, each first fingerprint record item includes a first fingerprint and a storage address of the data corresponding to the first fingerprint; the at least two The storage addresses of the data corresponding to the first fingerprint of the first fingerprint record item are all different; and,
  • the stub of the first fingerprint is recorded in the fingerprint record; wherein, the stub of the first fingerprint is used for fingerprinting, and the first fingerprint is a duplicate fingerprint.
  • an embodiment of the present application also provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the method in any one of the first aspect and the first aspect.
  • the embodiments of the present application also provide a computer program product, including instructions, which when run on a computer, cause the computer to execute the method in any one of the first aspect and the first aspect.
  • embodiments of the present application provide a chip system, which includes a processor and may also include a memory, for implementing the method in the first aspect and any one of the designs in the first aspect.
  • the chip system can be composed of chips, or can include chips and other discrete devices.
  • an embodiment of the present application provides a storage system that includes a storage device and the device for deleting duplicate data described in the second aspect and any one of the designs of the second aspect, or the storage system includes storage The device and the device for deleting duplicate data described in the third aspect and any one of the designs of the third aspect.
  • FIG. 1 is an example architecture diagram of a storage system provided by an embodiment of the application
  • FIG. 2 is a flowchart of a method for deleting duplicate data provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of an example of fingerprint records before the deduplication operation and fingerprint records after the deduplication operation in an embodiment of the application;
  • 4 to 5 are schematic diagrams of another example of fingerprint records before the data deduplication operation and fingerprint records after the data deduplication operation in an embodiment of the application;
  • 6 to 10 are schematic diagrams of examples of deleting fingerprint record items according to fingerprint stubs in an embodiment of the application.
  • FIG. 11 is a structural diagram of an example of a device for deleting duplicate data provided in an embodiment of this application.
  • FIG. 12 is a structural diagram of another example of a device for deleting duplicate data provided in an embodiment of this application.
  • Deduplication technology can be divided into online deduplication mode and post deduplication mode according to the time when the deduplication operation is performed.
  • the online deduplication mode refers to performing the deduplication operation before storing the data in the cache of the storage system in the storage device, and then storing the data after the deduplication operation in the storage device.
  • the post-deduplication method refers to the calculation of the fingerprint of the data in the cache, and after the data in the cache is stored in the storage device, the mapping between the fingerprint of the recorded data and the storage address is in a preset time period (for example, when the storage system is idle) Read the mapping, perform the deduplication operation on the data according to the fingerprint in the mapping, and store the data after the deduplication operation in the deduplication area of the storage device. It should be noted that the technical solution in the embodiment of the present application is an improvement for the post-deduplication method.
  • the fingerprint table is used to record the mapping between the fingerprint of the unique data after deduplication and the storage address of the unique data in the deduplication area.
  • the deduplication area refers to the storage area in the storage system used to store the only data after deduplication.
  • multiple refers to two or more than two. In view of this, “multiple” can also be understood as “at least two” in the embodiments of this application. "At least one" can be understood as one or more, for example, one, two or more. For example, including at least one refers to including one, two or more, and does not limit which ones are included. For example, including at least one of A, B, and C, then the included can be A, B, C, A and B, A and C, B and C, or A and B and C.
  • ordinal numbers such as “first” and “second” mentioned in the embodiments of the present application are used to distinguish multiple objects, and are not used to limit the order, timing, priority, or importance of multiple objects.
  • FIG. 1 is a structural diagram of an example of a storage system to which the method in the embodiment of this application is applicable.
  • the storage system is taken as an example of a distributed storage system.
  • the storage system 100 includes one server 110 and three storage nodes 120 (respectively storage node 1 to storage node 3). Each storage node 120 includes at least one storage device.
  • the storage device may include Serial advanced technology attachment (SATA) hard disk, small computer system interface (SCSI) hard disk, serial attached SCSI interface (serial attached SCSI, SAS), fiber channel interface (fibre channel, FC) ) Hard disk, hard disk drive (HDD) and solid state drive (SSD), etc.
  • SATA Serial advanced technology attachment
  • SCSI small computer system interface
  • SAS serial attached SCSI interface
  • FC fiber channel interface
  • HDD hard disk drive
  • SSD solid state drive
  • Figure 2 is a flowchart of a method for deleting duplicate data.
  • the application of this method to the storage system shown in FIG. 1 will be taken as an example for description.
  • the description of the flowchart is as follows:
  • the storage system obtains a fingerprint record.
  • Each fingerprint record item contains a mapping between the fingerprint and the storage address of the data corresponding to the fingerprint.
  • the storage system receives data, calculates the fingerprint of the data, stores the data, generates fingerprint record items, and performs deduplication on the stored data in a preset time period (for example, when the storage system is idle).
  • the fingerprint record item contains the mapping between the fingerprint of the data and the storage address of the data.
  • the fingerprint record may be recorded in the form of a log, or the fingerprint record may be recorded in the form of an entry, which is not limited in the embodiment of the present application.
  • the fingerprint record item corresponding to the data includes three parts, namely a serial number, a fingerprint (fingerprint, FP), and a token (token).
  • the serial number can indicate the generation order of fingerprint record items
  • the token can indicate the storage address of the data and other information.
  • the number in the fingerprint record item is used as an exemplary implementation to indicate the sequence of the fingerprint record item.
  • the serial number may not be used, and the fingerprint record items are sorted based on the generation time.
  • the storage system acquiring fingerprint records specifically means that the server of the storage system acquires fingerprint records.
  • the fingerprint record can also be obtained by other devices or equipment.
  • the storage system obtains the fingerprint record specifically for the array controller of the storage array to obtain the fingerprint record. .
  • the storage system sorts the fingerprint records.
  • the storage system can sort the fingerprint record items in the fingerprint record in the order of FP in the fingerprint record items from small to large.
  • fingerprint record items with the same fingerprint are arranged together.
  • FIG 3(a) there are 5 different fingerprints, namely FP_0 ⁇ FP_4, including 3 FP_1 and 4 FP_4. After sorting according to FP from small to large, the fingerprints shown in Figure 3(b) are obtained. Fingerprint record shown.
  • the storage system determines duplicate fingerprints from the fingerprint record.
  • the storage system determines the duplicate fingerprints from the fingerprint record according to the threshold of duplicate fingerprints. In this way, the storage system determines, according to the sorted fingerprint records, whether the number of occurrences of fingerprint record items containing the same fingerprint is greater than or equal to the threshold, and if it is greater than the threshold, the fingerprint is determined to be a duplicate fingerprint.
  • a certain fingerprint is a duplicate fingerprint, it means that the data stored in the storage address in the fingerprint record item containing the same fingerprint is duplicate data.
  • the threshold may be 3.
  • the storage system performs data deduplication on the data corresponding to the fingerprint determined to be a duplicate fingerprint in the fingerprint record item.
  • FP_1 and FP_4 are duplicate fingerprints. That is, the fingerprint record has 3 FP_1, that is, fingerprints with 3 data are all FP_1; 4 FP_4, that is, fingerprints with 4 data are all FP_4. Deduplicate the data corresponding to FP_1 and FP_4 respectively. On the one hand, the data corresponding to FP_1 and FP_4 in the fingerprint record is already duplicate data. Use the data corresponding to FP_1 and FP_4 to query the fingerprint table.
  • the fingerprint FP_1 when the fingerprint FP_1 is found in the fingerprint table, it indicates that the unique data corresponding to the fingerprint FP_1 has been stored in the storage system.
  • the fingerprint table records the mapping of the storage address of the unique data corresponding to the fingerprint FP_1 and the fingerprint FP_1. Therefore, it is no longer necessary to store the data corresponding to FP_1 in the fingerprint record.
  • the storage system deletes the fingerprint record item including the duplicate fingerprint from the fingerprint record.
  • the fingerprint record items including duplicate fingerprints are deleted from the fingerprint record. For example, after the fingerprint record items containing FP_1 and FP_4 are deleted, the fingerprint record as shown in FIG. 3(c) is obtained. For other fingerprint record items, since the fingerprints in these fingerprint record items are not duplicate fingerprints, these fingerprint records are kept in the fingerprint record.
  • the data in the fingerprint record whose repetition number of fingerprints reaches the threshold is deduplicated, which improves the deduplication rate of the storage system.
  • the fingerprint record items containing the fingerprints corresponding to these data are deleted in the fingerprint record, then the data corresponding to these fingerprints will be written into the storage system. Since the fingerprint record does not contain the fingerprint record items of these fingerprints, the new The written data cannot be deduplicated because the number of repetitions of the corresponding fingerprint does not reach the threshold.
  • the embodiment of the present application further includes:
  • the storage system records a stub of the fingerprint in the fingerprint record item that has been deleted in the fingerprint record.
  • the stub of the fingerprint in the deleted fingerprint record is used to indicate that the fingerprint in the deleted fingerprint record is a duplicate fingerprint.
  • the stubs corresponding to the three duplicate fingerprints are added to the fingerprint record, namely The stub of FP_1, the stub of FP_4, and the stub of FP_9 obtain the fingerprint record as shown in Figure 4(b).
  • the stub of each fingerprint can be used as a record item, and the record item is indicated as a fingerprint stub by changing the information in the token to stub.
  • the token of the record item corresponding to FP_1 may be marked as stub_1, the token of the record item corresponding to FP_4 may be marked as stub_2, and the token of the record item corresponding to FP_9 may be marked as stub_3.
  • the storage system records the new fingerprint record item in the fingerprint record.
  • the new fingerprint record item includes the fingerprint FP_1 and the new storage address of the data corresponding to the FP_1, and the data corresponding to the fingerprint FP_1 in the new fingerprint record item is newly written data.
  • the storage system receives the new data, calculates the fingerprint of the new data, stores the new data, and generates a fingerprint record item corresponding to the new data.
  • the storage system determines that the fingerprint in the new fingerprint record item is a duplicate fingerprint according to the stub of the fingerprint in the deleted fingerprint record item.
  • a new fingerprint record item When a new fingerprint record item is recorded in the fingerprint record, compare the new fingerprint record item with the stub in the fingerprint record to determine whether the fingerprint in the new fingerprint record item is the same as the fingerprint corresponding to the stub. If they are the same, then It is determined that the fingerprint in the new fingerprint record item is a duplicate fingerprint; otherwise, the fingerprint is not a duplicate fingerprint, and the duplicate data deletion operation is performed after waiting for the number of repetitions of the fingerprint to reach the threshold.
  • the fingerprint of the new data recorded in the new fingerprint record item is FP_1, which is the same as the fingerprint corresponding to the stub of fingerprint FP_1. Therefore, the fingerprint of the new data is a duplicate fingerprint.
  • the fingerprint corresponding to the new data can be compared with the stub. If the fingerprint corresponding to the new data is the same as the fingerprint indicated by the stub, the new data can be compared.
  • the deduplication operation does not need to wait for the number of repetitions of fingerprints containing new data to reach the threshold, which can improve the efficiency of the deduplication technology.
  • the storage system performs a data deduplication operation on the newly written data.
  • the fingerprint of the new data is a duplicate fingerprint, which means that the data has been stored in the storage device, so that the new data can be directly deduplicated.
  • the storage system deletes the new fingerprint record item.
  • the new fingerprint record item corresponding to the new data in the fingerprint record is deleted, thereby obtaining the fingerprint record as shown in FIG. 4(d).
  • the storage space occupied by the fingerprint records can be reduced, and the utilization rate of the storage space can be improved.
  • the new data may also be different from the data already stored in the storage system.
  • the new data also includes data 23, the fingerprint of the data 23 is calculated as FP_8, and the token corresponding to the service data 23 is token_23, and the fingerprint record as shown in FIG. 5(a) is obtained. Since the fingerprint record item corresponding to the service data 23 includes FP_8, which is different from the fingerprint in any fingerprint record item in the fingerprint record, the fingerprint included in the fingerprint record item corresponding to the service data 23 is not a duplicate fingerprint. Therefore, no deduplication operation is performed on the service data 23, and the fingerprint record item corresponding to the service data 23 is not deleted, thereby obtaining a fingerprint record as shown in FIG. 5(b).
  • the storage system deletes some fingerprint record items.
  • the fingerprint record is stored in the deduplication metadata space. Due to the limited space for deduplication metadata, as more and more data is written into the storage system, the storage space occupied by the fingerprint record may be reduced. If the first threshold is exceeded, the first threshold may be 80% or 70% of the maximum value of the deduplication metadata space. If the storage space occupied by the fingerprint record exceeds the first threshold, please refer to Figure 6(a), it is necessary to delete some fingerprint record items in the fingerprint record, or it can also be understood as eliminating some fingerprint record items. It should be noted that eliminating or deleting fingerprint record items means that only the fingerprint record items are processed without processing the data corresponding to the fingerprint record items.
  • deleting part of the fingerprint record items may include but is not limited to the following three methods.
  • the third fingerprint record item is deleted.
  • the fingerprint included in the third fingerprint record item is different from the fingerprints included in other fingerprint record items in the fingerprint record. In other words, delete the fingerprint record items placed in the fingerprint record.
  • a fingerprint record item is placed in the order within this time period, it means that the probability of repeated storage of the data corresponding to the fingerprint is very small, and the fingerprint record item needs to wait a longer time before the deduplication operation can be performed, which can be directly Delete the fingerprint record item to save the storage space occupied by the fingerprint record.
  • the fingerprint record items corresponding to FP_0, FP_6, FP_7, FP_8, FP_10, etc. are all fingerprint record items corresponding to the fingerprints placed on the order, then the fingerprints corresponding to these fingerprints The record item is deleted, and the fingerprint record as shown in Figure 6(b) is obtained.
  • the fourth fingerprint record item is deleted, and the duration of the fourth fingerprint record item stored in the fingerprint record is greater than or equal to the second threshold. In other words, delete the fingerprint record items written earlier in the fingerprint record.
  • the fingerprint record item Since the fingerprint record item is written into the fingerprint record earlier, it means that the data corresponding to the fingerprint record item is more likely to be overwritten by new data. If the data has been overwritten, the data will not be in the storage system With repeated storage, there is no need to perform a data deduplication operation on the data, so that earlier fingerprint record items written in the fingerprint record can be deleted, saving the storage space occupied by the fingerprint record.
  • the length of time the fingerprint record item is stored in the fingerprint record can be determined by the value of the token.
  • the second threshold may be the difference between the maximum value of the token value in the fingerprint record item, and the difference may be 20, 15, or the like. Take the difference of 20 as an example. In Figure 7(a), the maximum value of the token is 31. If the difference is 20, then the value of the token is 1 to 11. Delete to get the fingerprint record as shown in Figure 7(b).
  • the fingerprint record has not recorded a predetermined number of fifth fingerprint record items within a predetermined time. In other words, delete the fingerprint record items that appear less frequently in the fingerprint record.
  • a fingerprint record item appears less frequently in a predetermined time, it means that the probability of repeatedly storing the data corresponding to the fingerprint is low, so the fingerprint record item can be deleted directly to save the storage occupied by the fingerprint record space.
  • the predetermined number may be 1 (or 2), that is, the fingerprint record item corresponding to the fingerprint whose number of appearances is less than or equal to 1 time (or 2 times) in the fingerprint record is deleted.
  • the value of the predetermined number is 1, the result is the same as that in the first method.
  • the value of the predetermined number is 2, the specific process can refer to the first method, which will not be repeated here.
  • the stub of the second fingerprint is used to indicate that the second fingerprint is a duplicate fingerprint, and it is determined whether the fingerprint record records a predetermined number of thirds containing the second fingerprint within a predetermined time.
  • the fingerprint record item if the fingerprint record does not record the second predetermined number of third fingerprint record items within the predetermined time, the stub of the second fingerprint in the fingerprint record is deleted.
  • the fingerprint is recorded in subsequent records, and there are fewer fingerprint record items corresponding to the fingerprint, it means that the number of repeated fingerprints determined by the fingerprint stub is less, and In other words, the fingerprint stub contributes less to the determination of duplicate fingerprints, so the fingerprint stub can be deleted to save the storage space occupied by the fingerprint record.
  • the number of fingerprints corresponding to the stub can be included in the record within the preset time.
  • the number parameter is added to the token, and the value of the number parameter is preset
  • the number of fingerprints corresponding to the stub is included within the time period.
  • the value of the number parameter is 3, which means that the fingerprint record items including the fingerprint corresponding to the stub are recorded 3 times within the preset time.
  • the value of the number parameter is cleared every preset time, and the preset time can be 5s or 10s. If the predetermined number is 3, if the number carried after number in the record item corresponding to a stub is less than 3, the record item corresponding to the storage can be deleted, thereby obtaining the fingerprint record as shown in FIG. 8(b).
  • the time when the stub was used to determine the repeated fingerprints last time can be recorded.
  • the fingerprint record items corresponding to FP_0, FP_6, FP_7, FP_8, FP_10, FP_13, FP_16 ⁇ FP_18 are all fingerprint record items corresponding to the fingerprints placed on the order.
  • FP_10, FP_13, FP_16 The fingerprint record corresponding to ⁇ FP_18 is saved for a short time (because it is written later), therefore, only the fingerprint record items corresponding to FP_0, FP_6, FP_7, FP_8 are deleted, and the corresponding fingerprint records of FP_10, FP_13, FP_16 ⁇ FP_18 are retained
  • the fingerprint record is obtained as shown in Figure 10(b).
  • the storage server may also first determine the number of fingerprint record items that need to be deleted, and then select from the fingerprint record Delete the corresponding number of fingerprint record items. For example, if each fingerprint record item occupies the same space, the storage server can determine the maximum number of fingerprint record items that can be stored in the fingerprint record. For example, a maximum of 30 fingerprint record items can be stored.
  • the number of fingerprint record items When it reaches 33, it can be determined that there are 3 fingerprint record items that need to be deleted, so as to determine the 3 fingerprint record items that need to be deleted according to any one of the above five methods, so that 3 fingerprint records that meet the conditions need to be determined After the entry, the three determined fingerprint record items can be deleted instead of traversing the entire fingerprint record, which can improve efficiency.
  • fingerprint record items that need to be deleted can also be determined according to other methods, and no examples are given here.
  • the stub since the stub corresponding to the repeated fingerprint is added to the fingerprint record, the stub can be used to directly determine that the fingerprint included in the fingerprint record item is a repeated fingerprint, instead of waiting as in the prior art.
  • the fingerprint can be determined after being repeated a certain number of times, so that the duplicate fingerprint can be determined quickly, and the data corresponding to the duplicate fingerprint can be deduplicated, which can improve the efficiency of the deduplication technology.
  • the storage system may include a hardware structure and/or a software module, and a hardware structure, a software module, or a hardware structure plus a software module Form to achieve the above functions. Whether one of the above-mentioned functions is executed in a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraint conditions of the technical solution.
  • FIG. 11 shows a schematic structural diagram of an apparatus 1100 for deleting duplicate data.
  • the device 1100 for deleting duplicate data can be used to implement the function of the server of the distributed storage system, and can also be used to implement the function of the array controller in the storage array.
  • the device 1100 for deleting duplicate data may be a hardware structure, a software module, or a hardware structure plus a software module.
  • the device 1100 for deleting duplicate data may be implemented by a chip system. In the embodiments of the present application, the chip system may be composed of chips, or may include chips and other discrete devices.
  • the device 1100 for deleting duplicate data may include a processing module 1101 and a communication module 1102.
  • the processing module 1101 may be used to execute steps S201 to S211 in the embodiment shown in FIG. 2 and/or to support other processes of the technology described herein.
  • the communication module 1102 may be used to support the communication system in the embodiment shown in FIG. 2 to obtain data, and/or to support other processes of the technology described herein.
  • the communication module 1102 is used for the device 1100 for deduplication of data to communicate with other modules, and it may be a circuit, a device, an interface, a bus, a software module, a transceiver, or any other device that can realize communication.
  • the division of modules in the embodiment shown in FIG. 11 is illustrative, and is only a logical function division. In actual implementation, there may be other division methods.
  • the functional modules in each embodiment of the present application may be integrated In a processor, it can also exist alone physically, or two or more modules can be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the device 1200 for deleting duplicate data can be used to implement the server function of a distributed storage system, and can also be used to implement an array in a storage array.
  • the device 1200 for deleting duplicate data may be a chip system.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the device 1200 for deleting duplicate data includes at least one processor 1220, which is configured to implement or support the device 1200 for deleting duplicate data to implement the function of the storage server in the method provided in the embodiment of the present application.
  • the processor 1220 may perform a data deduplication operation on newly written data. For details, refer to the detailed description in the method example, which is not repeated here.
  • the device 1200 for deleting duplicate data may further include at least one memory 1230 for storing program instructions and/or data.
  • the memory 1230 and the processor 1220 are coupled.
  • the coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units, or modules, and may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • the processor 1220 may cooperate with the memory 1230 to operate.
  • the processor 1220 may execute program instructions stored in the memory 1230. At least one of the at least one memory may be included in the processor.
  • the device 1200 for deleting duplicate data may further include a communication interface 1210 for communicating with other devices through a transmission medium, so that the device 1200 for deleting duplicate data can communicate with other devices.
  • the other device may be a storage client or a storage device.
  • the processor 1220 may use the communication interface 1210 to send and receive data.
  • connection medium between the aforementioned communication interface 1210, the processor 1220, and the memory 1230 is not limited in the embodiment of the present application.
  • the memory 1230, the processor 1220, and the communication interface 1210 are connected by a bus 1240.
  • the bus is represented by a thick line in FIG. 12, and the connection mode between other components is only for schematic illustration. , Is not limited.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of representation, only one thick line is used to represent in FIG. 12, but it does not mean that there is only one bus or one type of bus.
  • the processor 1220 may be a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. Or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the memory 1230 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), For example, random-access memory (RAM).
  • the memory is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto.
  • the memory in the embodiments of the present application may also be a circuit or any other device capable of realizing a storage function, for storing program instructions and/or data.
  • An embodiment of the present application also provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the method executed by the storage server in the embodiment shown in FIG. 2.
  • An embodiment of the present application also provides a computer program product, including instructions, which when run on a computer, cause the computer to execute the method executed by the storage server in the embodiment shown in FIG. 2.
  • the embodiment of the present application provides a chip system.
  • the chip system includes a processor and may also include a memory for implementing the function of the storage server in the foregoing method.
  • the chip system can be composed of chips, or can include chips and other discrete devices.
  • An embodiment of the present application provides a storage system, which includes a storage device and a storage server in the embodiment shown in FIG. 2.
  • the methods provided in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented by software, it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, network equipment, user equipment, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • a computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc., integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, hard disk, Magnetic tape), optical media (for example, digital video disc (DVD for short)), or semiconductor media (for example, SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种重复数据的删除方法及装置,在该方法中,首先获取包含多个指纹记录项的指纹记录,然后从该指纹记录中确定出包括同一指纹的至少两个第一指纹记录项,例如,该至少两个第一指纹记录项中均包括第一指纹,从而将该至少两个第一指纹记录项中的该第一指纹对应的数据进行重复数据删除操作,删除该至少两个第一指纹记录项,以及,在该指纹记录中记录该第一指纹的存根,通过该第一指纹的存根指示该第一指纹为重复指纹。由于在指纹记录中增加与重复指纹对应的存根,这样,可以直接通过该存根来确定出该指纹记录项中包括的指纹是重复指纹,从而可以较快地确定出重复指纹,并进行重复数据删除操作,可以提高重复数据删除技术的效率。

Description

一种重复数据的删除方法及装置 技术领域
本申请涉及存储技术领域,尤其涉及一种重复数据的删除方法及装置。
背景技术
随着技术的发展,越来越多的数据需要使用存储系统进行存储。为了节省存储系统的存储空间,提出了重复数据删除技术,即,若某个数据在存储系统中存储多份,则将该多份数据删除而只保存一份数据,从而通过缩减数据实现减少数据所占用的存储空间的目的。
目前,其中一种重复数据删除技术的实现过程为:首先,计算每个数据的指纹,存储数据,记录指纹与数据的存储地址的映射。将存储的数据作为待重复数据删除的数据进行批量的重复数据删除。对存储的重复数据进行批量的重复数据删除包括:查询存储的数量在指纹表中是否有相同的指纹,如果有相同的指纹则判定该数据为重复数据,否则认为是唯一数据。并且删除前面的数据的指纹与存储地址的映射。可见,目前的重复数据删除技术是要在指纹表查找所有待重复的数据的指纹才能够判断是否是重复数据,造成重复数据删除效率低。
发明内容
本申请提供一种重复数据的删除方法及装置,用以提高重复数据删除技术的效率。
第一方面,提供一种重复数据的删除方法,在该方法中,首先获取包含多个指纹记录项的指纹记录,其中,每个指纹记录项中包含指纹和与该指纹对应的数据的存储地址,若两个数据相同,但是存储在不同的存储地址,则会对这两个数据分别生成不同的指纹记录项,这两个指纹记录项中包括相同的指纹但是与指纹对应的存储地址不同。在获取该指纹记录后,则从该指纹记录中确定出包括同一指纹的至少两个第一指纹记录项,例如,该至少两个第一指纹记录项中均包括第一指纹,从而将该至少两个第一指纹记录项中的该第一指纹对应的数据进行重复数据删除操作,删除该至少两个第一指纹记录项,以及,在该指纹记录中记录该第一指纹的存根,通过该第一指纹的存根指示该第一指纹为重复指纹。
在上述技术方案中,由于在指纹记录中增加与重复指纹对应的存根,这样,可以直接通过该存根来确定出该指纹记录项中包括的指纹是重复指纹,而不用像现有技术中要在指纹表查找所有待重复的数据的指纹才能够判断是否是重复数据,从而本申请可以较快地确定出重复指纹,并对与重复指纹对应的数据进行重复数据删除操作,可以提高重复数据删除技术的效率。
在一种可能的设计中,当存储系统中写入新的数据,则指纹记录中会记录与该新的数据对应的指纹记录项,作为一种示例,将该新的数据对应的指纹记录项记为第二指纹记录项,该第二指纹记录项包含该第一指纹以及该新的数据的存储地址。由于第一指纹的存根指示该第一指纹为重复指纹,而第二指纹记录项中包括该第一指纹,因此,确定第二指纹 记录项中的第一指纹为重复指纹,从而对该新的数据进行重复数据删除操作。
在上述技术方案中,当存储系统中存储新的数据后,可以将该新的数据对应的指纹与该存根进行比较,若该新的数据对应的指纹与该存根指示的指纹相同,则可以对该新的数据进行重复数据删除操作,从而无需查询指纹表便可以对数据进行重复删除操作,节省了查询指纹表的过程,可以提高重复数据删除技术的效率。
在一种可能的设计中,当删除与第二指纹记录项对应的新写入的数据后,则可以删除该第二指纹记录项。
在上述技术方案中,将无效的指纹记录删除可以减小指纹记录所占用的存储空间,可以提高存储空间的利用率。
在一种可能的设计中,在指纹记录所占用的存储空间大于或等于第一门限时,则可以删除指纹记录表中的第三指纹记录项,该第三指纹记录项包括的指纹与该指纹记录中其他指纹记录项包括的指纹均不同。也就是说,删除指纹记录中落单的指纹记录项。
在上述技术方案中,随着写入存储系统中的数据越来越多,则指纹记录所占用的存储空间越来越大,其所占用的存储空间需要一定时长才会大于或等于第一门限。若某一个指纹记录项在该时长内落单,则说明重复存储与该指纹对应的数据的概率很小,且该指纹记录项需要等待更长的时间才可以进行重复数据删除操作,从而可以直接将该指纹记录项删除,以节省指纹记录所占用的存储空间。
在一种可能的设计中,在该指纹记录所占用的存储空间大于或等于第一门限时,则可以删除第四指纹记录项,该第四指纹记录项保存在该指纹记录中的时长大于或等于第二门限。也就是说,删除指纹记录中写入时间较早的指纹记录项。
在上述技术方案中,若数据已经被覆盖,则该数据不会在存储系统中重复存储,也就没有必要对该数据进行重复数据删除操作。而指纹记录项写入指纹记录的时间越早,则说明该与该指纹记录项对应的数据被新的数据覆盖的可能性越大,从而可以将该指纹记录中写入较早的指纹记录项删除,节省指纹记录所占用的存储空间。
在一种可能的设计中,在指纹记录所占用的存储空间大于或等于第一门限时,则可以删除指纹记录表中的第五指纹记录项,该指纹记录在预定的时间内未记录预定数量的第五指纹记录项。也就是说,删除指纹记录中出现次数较少的指纹记录项。
在上述技术方案中,若某一个指纹记录项在预定的时间内出现的次数较少,则说明重复存储与该指纹对应的数据的概率较小,从而可以直接将该指纹记录项删除,以节省指纹记录所占用的存储空间。
在一种可能的设计中,若指纹记录中记录第二指纹的存根,第二指纹的存根用于指示所述第二指纹为重复指纹,则在该指纹记录所占用的存储空间大于或等于第一门限时,可以确定该指纹记录在预定的时间内是否记录预定数量的包含该第二指纹的第三指纹记录项,若该指纹记录在该预定的时间内未记录该第二预定数量的第三指纹记录项,则删除该指纹记录中的第二指纹的存根。
在上述技术方案中,若在指纹记录中记录某一个指纹的存根后,指纹记录在后续记录中,记录的与该指纹对应的指纹记录项较少,则说明通过该指纹的存根确定出重复指纹的次数较少,也就是说,该指纹的存根对确定重复指纹的贡献较少,从而可以删除该指纹的 存根,以节省指纹记录所占用的存储空间。
在本申请实施例中,不对第一门限、第二门限、预定数量以及预定的时间进行限制。
第二方面,提供一种重复数据的删除装置,该重复数据的删除装置可以是存储服务端,也可以是存储服务端中的装置。该重复数据的删除装置包括处理器,用于实现上述第一方面描述的方法。该重复数据的删除装置还可以包括存储器,用于存储程序指令和数据。该存储器与该处理器耦合,该处理器可以调用并执行该存储器中存储的程序指令,用于实现上述第一方面描述的方法中的任意一种方法。该重复数据的删除装置还可以包括通信接口,该通信接口用于该重复数据的删除装置与其它设备进行通信。示例性地,该其它设备为存储系统中的客户端。
在一种可能的设计中,该重复数据的删除装置包括处理器和通信接口,其中:
该通信接口,用于获取指纹记录,该指纹记录中包含多个指纹记录项,每个指纹记录项包含指纹;
该处理器,用于从该指纹记录中确定至少两个第一指纹记录项;其中,每个第一指纹记录项包含第一指纹和该第一指纹对应的数据的存储地址;该至少两个第一指纹记录项的该第一指纹对应的数据的存储地址均不同;以及,
对该至少两个第一指纹记录项中的该第一指纹对应的数据进行重复数据删除操作;以及,
删除该至少两个第一指纹记录项;以及,
在该指纹记录中记录该第一指纹的存根;其中,该第一指纹的存根用于指纹该第一指纹为重复指纹。
在一种可能的设计中,该处理器还用于:
在该指纹记录中记录第二指纹记录项;该第二指纹记录项包含该第一指纹以及该第一指纹对应的数据的新的存储地址;其中,该第二指纹记录项中的该第一指纹对应的数据为新写入的数据;
根据该第一指纹的存根确定该第二指纹记录项中的该第一指纹为重复指纹;
对该新写入的数据进行重复数据删除操作。
在一种可能的设计中,该处理器还用于:
删除该第二指纹记录项。
在一种可能的设计中,该处理器还用于:
在该指纹记录所占用的存储空间大于或等于第一门限时,删除第三指纹记录项,该第三指纹记录项包括的指纹与该指纹记录中其他指纹记录项包括的指纹均不同。
在一种可能的设计中,该处理器还用于:
在该指纹记录所占用的存储空间大于或等于第一门限时,删除第四指纹记录项,该第四指纹记录项保存在该指纹记录中的时长大于或等于第二门限。
在一种可能的设计中,该处理器还用于:
在指纹记录所占用的存储空间大于或等于第一门限时,则删除指纹记录表中的第五指纹记录项,该指纹记录在预定的时间内未记录预定数量的第五指纹记录项。
在一种可能的设计中,该处理器还用于:
在该指纹记录所占用的存储空间大于或等于第一门限时,确定该指纹记录在预定的时间内是否记录预定数量的第三指纹记录项;
当该指纹记录在该预定的时间内未记录该预定数量的第三指纹记录项时,删除该指纹记录中的第二指纹的存根;其中,该第二指纹的存根用于指纹该第二指纹为重复指纹;该第三指纹记录项包含该第二指纹。
第三方面,提供一种重复数据的删除装置,该重复数据删除装置可以是存储服务端,也可以是存储服务端中的装置,该重复数据删除装置可以包括处理模块和通信模块,这些模块可以执行上述第一方面任一种设计示例中的所执行的相应功能,具体的:
该通信模块,用于获取指纹记录,该指纹记录中包含多个指纹记录项,每个指纹记录项包含指纹;
该处理模块,用于从该指纹记录中确定至少两个第一指纹记录项;其中,每个第一指纹记录项包含第一指纹和该第一指纹对应的数据的存储地址;该至少两个第一指纹记录项的该第一指纹对应的数据的存储地址均不同;以及,
对该至少两个第一指纹记录项中的该第一指纹对应的数据进行重复数据删除操作;以及,
删除该至少两个第一指纹记录项;以及,
在该指纹记录中记录该第一指纹的存根;其中,该第一指纹的存根用于指纹该第一指纹为重复指纹。
第四方面,本申请实施例中还提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行第一方面及第一方面任一种设计中的方法。
第五方面,本申请实施例中还提供一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行第一方面及第一方面任一种设计中的方法。
第六方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,还可以包括存储器,用于实现第一方面及第一方面任一种设计中的方法。该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。
第七方面,本申请实施例提供了一种存储系统,该存储系统包括存储设备以及第二方面及第二方面任一种设计中所述的重复数据的删除装置,或者,该存储系统包括存储设备以及第三方面及第三方面任一种设计中所述的重复数据的删除装置。
上述第二方面至第六方面及其实现方式的有益效果可以参考对第一方面的方法及其实现方式的有益效果的描述。
附图说明
图1为本申请实施例提供的存储系统的一种示例的架构图;
图2为本申请实施例提供的一种重复数据的删除方法的流程图;
图3为本申请实施例中进行重复数据删除操作之前的指纹记录和进行重复数据删除操作之后的指纹记录的一种示例的示意图;
图4~图5为本申请实施例中进行重复数据删除操作之前的指纹记录和进行重复数据删 除操作之后的指纹记录的另一种示例的示意图;
图6~图10为本申请实施例中根据指纹的存根删除指纹记录项的示例的示意图;
图11为本申请实施例中提供的重复数据的删除装置的一种示例的结构图;
图12为本申请实施例中提供的重复数据的删除装置的另一种示例的结构图。
具体实施方式
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例作进一步地详细描述。
下面对本申请所涉及的技术术语进行说明,以便于本领域技术人员理解本申请的技术方案。
1)重删技术,按照执行重删操作的时刻,可以分为在线重删方式和后重删方式。其中,在线重删方式是指,在将存储系统的缓存中的数据存储到存储设备之前进行重删操作,然后将进行重删操作后的数据存储到存储设备中。后重删方式是指,计算缓存中的数据的指纹,将缓存中的数据存储到存储设备后,记录数据的指纹与存储地址的映射,在预设时间段(例如,在存储系统空闲时)读取映射,根据映射中的指纹对数据进行重删操作,并将进行重删操作后的数据存储到存储设备的重复数据删除区域中。需要说明的是,本申请实施例中的技术方案是针对后重删方式进行的改进。
2)指纹表,用于记录重复数据删除后的唯一数据的指纹与该唯一数据在重复数据删除区域的存储地址的映射。重复数据删除区域是指存储系统中用于存储重复数据删除后的唯一数据的存储区域。
3)本申请实施例中“多个”是指两个或两个以上,鉴于此,本申请实施例中也可以将“多个”理解为“至少两个”。“至少一个”,可理解为一个或多个,例如理解为一个、两个或更多个。例如,包括至少一个,是指包括一个、两个或更多个,而且不限制包括的是哪几个,例如,包括A、B和C中的至少一个,那么包括的可以是A、B、C、A和B、A和C、B和C、或A和B和C。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,字符“/”,如无特殊说明,一般表示前后关联对象是一种“或”的关系。
除非有相反的说明,本申请实施例提及“第一”、“第二”等序数词用于对多个对象进行区分,不用于限定多个对象的顺序、时序、优先级或者重要程度。
下面,结合附图对本申请实施例提供的重复数据的删除方法进行说明。
请参考图1,为本申请实施例中的方法所适用的存储系统的一种示例的架构图。在图1中以该存储系统为分布式存储系统为例。
在图1中,存储系统100包括1个服务端110和3个存储节点120(分别为存储节点1~存储节点3),每个存储节点120包括至少一个存储设备,该存储设备,例如可以包括串行高级技术附件(serial advanced technology attachment,SATA)硬盘、小型计算机系统接口(small computer system interface,SCSI)硬盘、串行连接SCSI接口(serial attached SCSI,SAS)、光纤通道接口(fibre channel,FC)硬盘、机械硬盘(hard disk drive,HDD)以及固态硬盘(solid state drive,SSD)等。
请参考图2,为重复数据的删除方法的流程图。在下面的描述中,将以该方法应用在如图1所示的存储系统为例进行说明。该流程图的描述如下:
S201、存储系统获取指纹记录。
每一条指纹记录项包含指纹与该指纹对应的数据的存储地址的映射。在后重复数据删除操作中,存储系统接收数据,计算数据的指纹,存储数据,生成指纹记录项,在预设时间段(例如,在存储系统空闲时)对存储的数据进行重复数据删除。指纹记录项包含数据的指纹与数据的存储地址的映射。
具体实现中,可以使用日志的形式记录指纹记录,或者使用表项的形式记录指纹记录,本申请实施例对此不作限定。
作为一种示例,请参考图3,假设存储系统存储了10个数据,指纹记录中包含这10个数据的指纹记录项,如图3(a)所示。在图3(a)中,与数据对应的指纹记录项包括三个部分,分别为编号、指纹(finger print,FP)以及令牌(token)。编号可以指示指纹记录项的生成顺序,token来指示该数据的存储地址等信息。本申请实施例中,指纹记录项中的编号作为一种示例性的实现,用于表示指纹记录项的顺序。在另一种实现中也可以不使用编号,基于指纹记录项的生成时间排序。
需要说明的是,在存储系统为分布式存储系统的场景中,存储系统获取指纹记录具体为存储系统的服务端获取指纹记录。在存储系统为其他场景时,获取该指纹记录的也可以是其他装置或设备,例如,在存储系统为为存储阵列的场景中,存储系统获取指纹记录具体为存储阵列的阵列控制器获取指纹记录。
S202、存储系统对指纹记录进行排序。
具体来讲,存储系统可以按照指纹记录项中的FP从小到大的顺序,将指纹记录中的指纹记录项进行排序。这样,具有相同的指纹的指纹记录项排列在一起。例如,在图3(a)中包括5个不同的指纹,分别为FP_0~FP_4,其中,包括3个FP_1,4个FP_4,按照FP的从小到大排序后,得到如图3(b)所示的指纹记录。
S203、存储系统从指纹记录中确定重复指纹。
存储系统根据重复指纹的门限值从指纹记录中确定重复的指纹。这样,存储系统根据排序后的指纹记录,确定包括包含同一指纹的指纹记录项出现的次数是否大于或等于该门限值,若大于该门限值,则确定该这一指纹是重复指纹。
若某一个指纹为重复指纹,则说明包含该同一指纹的指纹记录项中的存储地址中存储的数据是重复数据。
作为一种示例,该门限值可以为3。在图3(b)所示的指纹记录中,有3个指纹记录项包括FP_1,以及,有4个指纹记录项包括FP_4,从而确定FP_1和FP_4为重复指纹。
S204、存储系统对指纹记录项中确定为重复指纹的指纹对应的数据进行重复数据删除。
仍然以图3(b)所示的指纹记录为例进行说明,FP_1和FP_4为重复指纹。即指纹记录有3个FP_1,即有3个数据的指纹均为FP_1;4个FP_4,即有4个数据的指纹均为FP_4。对FP_1和FP_4分别对应的数据进行重复数据删除。一方面,指纹记录中的FP_1和FP_4分别对应的数据本身已经是重复数据,使用FP_1和FP_4分别对应的数据查询指纹表,即使在指纹表中查找不到FP_1和FP_4分别对应的数据,仍然可以对指纹记录中的FP_1和 FP_4分别对应的数据本身进行重复数据删除操作,可以提高重复数据删除效率。具体实现,当在指纹表中查找到指纹FP_1,则表明存储系统中已经存储的指纹FP_1对应的唯一数据。指纹表中记录了指纹FP_1与指纹FP_1对应的唯一数据的存储地址的映射。因此不再需要存储指纹记录中的FP_1对应的数据,只需要建立指纹记录中的FP_1对应的数据的主机访问地址与指纹表中的指纹FP_1的映射即可。当在指纹表中查找不到指纹FP_4,则表明存储系统中没有存储的指纹FP_4对应的唯一数据。从指纹记录中包含指纹FP_4的指纹记录项中选择一个指纹记录,读取该指纹记录中指纹FP_4对应存储地址中的数据,将该数据存储到重复数据删除区域中,得到该数据的新的存储地址,在指纹表中建立指纹FP_4与该新的存储地址的映射。
S205、存储系统从指纹记录中删除包括该重复指纹的指纹记录项。
作为一种示例,在指纹记录中删除包括重复指纹的指纹记录项,例如,删除包含FP_1和FP_4的指纹记录项后,则得到如图3(c)所示的指纹记录。而针对其他指纹记录项,由于这些指纹记录项中的指纹不是重复指纹,从而将这些指纹记录继续保存在该指纹记录中。
上述将指纹记录中指纹的重复次数达到门限值的数据进行重复数据删除,提高了存储系统的重复数据删除率。但如果重复数据删除后,删除指纹记录中包含这些数据对应的指纹的指纹记录项,则再有这些指纹对应的数据写入存储系统,由于指纹记录中不包含这些指纹的指纹记录项,则新写入数据由于其对应的指纹的重复次数达不到门限值而无法进行重复数据删除,为解决这一问题,本申请实施例进一步包括:
S206、存储系统在该指纹记录中记录已经删除的指纹记录项中的指纹的存根。
在本申请实施例中,该已经删除的指纹记录项中的指纹的存根用于指示已经删除的指纹记录项中的指纹为重复指纹。
具体来讲,在图4(a)所示的指纹记录中,重复指纹有3个,分别为FP_1、FP_4以及FP_9,从而,在指纹记录中分别增加与该3个重复指纹对应的存根,即FP_1的存根,FP_4的存根以及FP_9的存根,从而得到如图4(b)所示的指纹记录。在图4(b)中,每个指纹的存根可以作为一个记录项,通过将令牌中的信息更改为stub来指示该记录项为指纹的存根。其中,与FP_1对应的记录项的令牌可以标记为stub_1,与FP_4对应的记录项的令牌可以标记为stub_2,与FP_9对应的记录项的令牌可以标记为stub_3。
S207、存储系统在指纹记录中记录新的指纹记录项。
在本申请实施例中,该新的指纹记录项包含该指纹FP_1以及该FP_1对应的数据的新的存储地址,该新指纹记录项中的指纹FP_1对应的数据为新写入的数据。
存储系统接收新的数据,计算该新数据的指纹,存储该新数据,并生成与该新的数据对应的指纹记录项。
S208、存储系统根据已经删除的指纹记录项中的指纹的存根确定该新的指纹记录项中的指纹为重复指纹。
当指纹记录中记录新的指纹记录项后,将新的指纹记录项与指纹记录中的存根进行比对,确定该新的指纹记录项中的指纹是否与存根对应的指纹相同,若相同,则确定该新的指纹记录项中的指纹是重复指纹,否则,该指纹不是重复指纹,从而等待该指纹的重复次 数达到门限后进行重复数据删除操作。
作为一种示例,图4(c)的指纹记录中,新的指纹记录项中记录的新的数据的指纹为FP_1,与指纹FP_1的存根对应的指纹相同。因此,新的数据的指纹为重复指纹。
这样,当存储系统中存储新的数据后,可以将该新的数据对应的指纹与该存根进行比较,若该新的数据对应的指纹与该存根指示的指纹相同,则可以对该新的数据进行重复数据删除操作,无需等待包含新的数据的指纹的重复次数达到门限值,可以提高重复数据删除技术的效率。
S209、存储系统对该新写入的数据进行重复数据删除操作。
新的数据的指纹为重复指纹,则说明存储设备中已经存储了该数据,从而可以直接对该新的数据进行重复数据删除操作。
需要说明的是,对该新写入的业务数据进行重复数据删除时,由于指纹表中已经存储了该指纹,因此,也可以不用查询指纹表,直接建立该新的数据的主机访问地址与指纹FP_1的映射即可,可以减少重删操作的时延。
S210、存储系统删除该新的指纹记录项。
当对新写入的业务数据进行重复删除操作之后,则删除指纹记录中与该新的数据对应的新的指纹记录项,从而得到如图4(d)所示的指纹记录。通过将无效的指纹记录删除可以减小指纹记录所占用的存储空间,可以提高存储空间的利用率。
需要说明的是,新的数据也可能与存储系统中已经存储的数据均不相同。例如,新的数据还包括数据23,计算得到数据23的指纹为FP_8,与业务数据23对应的令牌为token_23,获取如图5(a)所示的指纹记录。由于与业务数据23对应的指纹记录项中包括FP_8,与指纹记录中的任意一个指纹记录项中的指纹均不相同,因此,与业务数据23对应的指纹记录项中包括的指纹不是重复指纹,从而不会对业务数据23进行重复数据删除操作,也不会删除与业务数据23对应的指纹记录项,从而得到如图5(b)所示的指纹记录。
S211、存储系统在指纹记录所占用的存储空间大于或等于第一门限时,删除部分指纹记录项。
在本申请实施例中,指纹记录保存在重删元数据空间中,由于重删元数据空间有限,随着写入存储系统的数据越来越多,则该指纹记录所占用的存储空间可能会超过第一门限,该第一门限可以是重删元数据空间的最大值的80%或者70%等。若指纹记录所占用的存储空间超过该第一门限,请参考图6(a),则需要删除指纹记录中的部分指纹记录项,或者,也可以理解为淘汰部分指纹记录项。需要说明的是,淘汰或者删除指纹记录项是指,只对指纹记录项进行处理,而不用处理与指纹记录项对应的数据。
在本申请实施例中,删除部分指纹记录项可以包括但不限于如下三种方式。
第一种方式:
删除第三指纹记录项,该第三指纹记录项包括的指纹与指纹记录中其他指纹记录项包括的指纹均不同。也就是说,删除指纹记录中落单的指纹记录项。
若某一个指纹记录项在该时长内落单,则说明重复存储与该指纹对应的数据的概率很小,且该指纹记录项需要等待更长的时间才可以进行重复数据删除操作,从而可以直接将该指纹记录项删除,以节省指纹记录所占用的存储空间。
作为一种示例,在图6(a)中,与FP_0、FP_6、FP_7、FP_8、FP_10等对应的指纹记录项均为与落单的指纹对应的指纹记录项,则可以将这些指纹对应的指纹记录项删除,从而得到如图6(b)所示的指纹记录。
第二种方式:
删除第四指纹记录项,该第四指纹记录项保存在指纹记录中的时长大于或等于第二门限。也就是说,删除指纹记录中写入时间较早的指纹记录项。
由于指纹记录项写入指纹记录的时间越早,则说明该与该指纹记录项对应的数据被新的数据覆盖的可能性越大,若数据已经被覆盖,则该数据不会在存储系统中重复存储,也就没有必要对该数据进行重复数据删除操作,从而可以将该指纹记录中写入较早的指纹记录项删除,节省指纹记录所占用的存储空间。
作为一种示例,若存储系统是顺序写入数据,则数据的存储地址越小,则该数据在存储系统中存储的时间越长,从而与该数据对应的指纹记录项在指纹记录中保存的时间越长。因此,可以通过令牌的取值来确定指纹记录项在指纹记录中保存的时长。该第二门限可以是与指纹记录项中令牌的取值的最大值之间的差值,该差值可以为20或15等。以该差值为20为例,在图7(a)中,令牌的取值的最大值为31,该差值为20,则将该令牌的取值为1~11的指纹记录项删除,从而得到如图7(b)所示的指纹记录。
第三种方式:
删除指纹记录表中的第五指纹记录项,该指纹记录在预定的时间内未记录预定数量的第五指纹记录项。也就是说,删除指纹记录中出现次数较少的指纹记录项。
若某一个指纹记录项在预定的时间内出现的次数较少,则说明重复存储与该指纹对应的数据的概率较小,从而可以直接将该指纹记录项删除,以节省指纹记录所占用的存储空间。
作为一种示例,该预定数量可以为1(或2),也就是说,删除指纹记录中,出现的次数小于或等于1次(或2次)的指纹对应的指纹记录项。当该预定数量的取值为1时,其结果与第一种方式中相同,当该预定数量的取值为2时,其具体过程可以参照第一种方式,在此不再赘述。
第四种方式:
若指纹记录中记录第二指纹的存根,第二指纹的存根用于指示所述第二指纹为重复指纹,确定该指纹记录在预定的时间内是否记录预定数量的包含该第二指纹的第三指纹记录项,若该指纹记录在该预定的时间内未记录该第二预定数量的第三指纹记录项,则删除该指纹记录中的第二指纹的存根。
若在指纹记录中记录某一个指纹的存根后,指纹记录在后续记录中,记录的与该指纹对应的指纹记录项较少,则说明通过该指纹的存根确定出重复指纹的次数较少,也就是说,该指纹的存根对确定重复指纹的贡献较少,从而可以删除该指纹的存根,以节省指纹记录所占用的存储空间。
作为一种示例,可以在指纹的存根对应的记录项中,记录预设的时间内包括该存根对应的指纹的数量,例如,在令牌中增加number参数,该number参数的取值则预设的时间内包括该存根对应的指纹的数量,例如,number参数的取值为3,则说明在预设的时间内 记录了3次包括该存根对应的指纹的指纹记录项,请参考图8(a)。需要说明的是,该number参数的取值每隔该预设的时间会清零,该预设的时间可以是5s或者10s等。若预定数量为3,则若某个存根对应的记录项中number后携带的编号小于3,则可以删除该存储对应的记录项,从而得到如图8(b)所示的指纹记录。
作为另一种示例,可以在指纹的存根对应的记录项中,记录最近一次使用该存根确定出重复指纹的时间。例如,可以记录该存根在第几次排序过程中确定出重复指纹,可以标记为sorted,请参考图9(a),其中,sorted_1表示使用该存根在上一次排序过程中确定出重复指纹,sorted_2表示使用该存根在距离当前次排序的前2次排序过程中确定出重复指纹,以此类推。若某个存根对应的记录项中sorted后携带的编号大于2,则可以删除该存储对应的记录项,从而得到如图9(b)所示的指纹记录。
第五种方式:
可以将上述第一种方式~第四种方式中的任意两种或更多种方式结合。
作为一种示例,将第一种方式和第二种方式进行结合。在图10(a)中,与FP_0、FP_6、FP_7、FP_8、FP_10、FP_13、FP_16~FP_18对应的指纹记录项均为与落单的指纹对应的指纹记录项,但是,由于FP_10、FP_13、FP_16~FP_18对应的指纹记录保存的时间较短(因为其写入的时间较晚),从而,只删除与FP_0、FP_6、FP_7、FP_8对应的指纹记录项,保留与FP_10、FP_13、FP_16~FP_18对应的指纹记录,得到如图10(b)所示的指纹记录。
另外,在本申请实施例中,当确定指纹记录所占用的存储空间大于或等于该第一门限后,存储服务端也可以先确定需要删除的指纹记录项的数量,然后则从该指纹记录中删除对应数量的指纹记录项。例如,每个指纹记录项所占用的空间相同,则存储服务端可以确定指纹记录中最多可以存储多少个指纹记录项,例如,最多可以存储30个指纹记录项,则当指纹记录项中的数量达到33个时,可以确定需要删除的指纹记录项为3个,从而根据上述五种方式中的任意一种方式确定需要删除的3个指纹记录项,从而需要确定出3个满足条件的指纹记录项后,则可以将确定的3个指纹记录项删除,而不用遍历整个指纹记录,可以提高效率。
需要说明的是,还可以根据其他方式确定需要删除的指纹记录项,在此不一一举例。
在上述技术方案中,由于在指纹记录中增加与重复指纹对应的存根,这样,可以直接通过该存根来确定出该指纹记录项中包括的指纹是重复指纹,而不用像现有技术中只有等待该指纹重复到一定次数后才能确定,从而可以较快地确定出重复指纹,并对与重复指纹对应的数据进行重复数据删除操作,可以提高重复数据删除技术的效率。
另外,需要说明的是,在图2所示的实施例中,在分布式存储系统场景下是由分布式存储系统的服务端执行的;在存储阵列场景下是由存储阵列的阵列控制器执行的。
上述本申请提供的实施例中,为了实现上述本申请实施例提供的方法中的各功能,存储系统可以包括硬件结构和/或软件模块,以硬件结构、软件模块、或硬件结构加软件模块的形式来实现上述各功能。上述各功能中的某个功能以硬件结构、软件模块、还是硬件结构加软件模块的方式来执行,取决于技术方案的特定应用和设计约束条件。
图11示出了一种重复数据的删除装置1100的结构示意图。其中,重复数据的删除装置1100可以用于实现分布式存储系统的服务端的功能,也可以用于实现存储阵列中的阵列 控制器的功能。重复数据的删除装置1100可以是硬件结构、软件模块、或硬件结构加软件模块。重复数据的删除装置1100可以由芯片系统实现。本申请实施例中,芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。
重复数据的删除装置1100可以包括处理模块1101和通信模块1102。
处理模块1101可以用于执行图2所示的实施例中的步骤S201~步骤S211,和/或用于支持本文所描述的技术的其它过程。
通信模块1102可以用于支持图2所示的实施例中通信系统获取数据,和/或用于支持本文所描述的技术的其它过程。通信模块1102用于重复数据的删除装置1100和其它模块进行通信,其可以是电路、器件、接口、总线、软件模块、收发器或者其它任意可以实现通信的装置。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
图11所示的实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
如图12所示为本申请实施例提供的重复数据的删除装置1200,其中,重复数据的删除装置1200可以用于实现分布式存储系统的服务端的功能,也可以用于实现存储阵列中的阵列控制器的功能。其中,该重复数据的删除装置1200可以为芯片系统。本申请实施例中,芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。
重复数据的删除装置1200包括至少一个处理器1220,用于实现或用于支持重复数据的删除装置1200实现本申请实施例提供的方法中存储服务端的功能。示例性地,处理器1220可以对新写入的数据进行重复数据删除操作,具体参见方法示例中的详细描述,此处不做赘述。
重复数据的删除装置1200还可以包括至少一个存储器1230,用于存储程序指令和/或数据。存储器1230和处理器1220耦合。本申请实施例中的耦合是装置、单元或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。处理器1220可能和存储器1230协同操作。处理器1220可能执行存储器1230中存储的程序指令。所述至少一个存储器中的至少一个可以包括于处理器中。
重复数据的删除装置1200还可以包括通信接口1210,用于通过传输介质和其它设备进行通信,从而用于重复数据的删除装置1200可以和其它设备进行通信。示例性地,该其它设备可以是存储客户端或者存储设备。处理器1220可以利用通信接口1210收发数据。
本申请实施例中不限定上述通信接口1210、处理器1220以及存储器1230之间的具体连接介质。本申请实施例在图12中以存储器1230、处理器1220以及通信接口1210之间通过总线1240连接,总线在图12中以粗线表示,其它部件之间的连接方式,仅是进行示意性说明,并不引以为限。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
在本申请实施例中,处理器1220可以是通用处理器、数字信号处理器、专用集成电路、 现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
在本申请实施例中,存储器1230可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。
本申请实施例中还提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行图2所示的实施例中存储服务端执行的方法。
本申请实施例中还提供一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行图2所示的实施例中存储服务端执行的方法。
本申请实施例提供了一种芯片系统,该芯片系统包括处理器,还可以包括存储器,用于实现前述方法中存储服务端的功能。该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。
本申请实施例提供了一种存储系统,该存储系统包括存储设备以及图2所示的实施例中存储服务端。
本申请实施例提供的方法中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,简称DSL)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机可以存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,简称DVD))、或者半导体介质(例如,SSD)等。

Claims (20)

  1. 一种重复数据的删除方法,其特征在于,包括:
    获取指纹记录,所述指纹记录中包含多个指纹记录项,每个指纹记录项包含指纹;
    从所述指纹记录中确定至少两个第一指纹记录项;其中,每个第一指纹记录项包含第一指纹和所述第一指纹对应的数据的存储地址;所述至少两个第一指纹记录项的所述第一指纹对应的数据的存储地址均不同;
    对所述至少两个第一指纹记录项中的所述第一指纹对应的数据进行重复数据删除操作;
    删除所述至少两个第一指纹记录项;
    在所述指纹记录中记录所述第一指纹的存根;其中,所述第一指纹的存根用于指示所述第一指纹为重复指纹。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述指纹记录中记录第二指纹记录项;所述第二指纹记录项包含所述第一指纹以及所述第一指纹对应的数据的新的存储地址;其中,所述第二指纹记录项中的所述第一指纹对应的数据为新写入的数据;
    根据所述第一指纹的存根确定所述第二指纹记录项中的所述第一指纹为重复指纹;
    对所述新写入的数据进行重复数据删除操作。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    删除所述第二指纹记录项。
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述方法还包括:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,删除第三指纹记录项,所述第三指纹记录项包括的指纹与所述指纹记录中其他指纹记录项包括的指纹均不同。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述方法还包括:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,删除第四指纹记录项,所述第四指纹记录项保存在所述指纹记录中的时长大于或等于第二门限。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,确定所述指纹记录在预定的时间内是否记录预定数量的第三指纹记录项;
    当所述指纹记录在所述预定的时间内未记录所述预定数量的第三指纹记录项时,删除所述指纹记录中的第二指纹的存根;其中,所述第二指纹的存根用于指示所述第二指纹为重复指纹;所述第三指纹记录项包含所述第二指纹。
  7. 一种重复数据的删除装置,其特征在于,包括通信接口和处理器,其中:
    所述通信接口,用于获取指纹记录,所述指纹记录中包含多个指纹记录项,每个指纹记录项包含指纹;
    所述处理器,用于从所述指纹记录中确定至少两个第一指纹记录项;其中,每个第一指纹记录项包含第一指纹和所述第一指纹对应的数据的存储地址;所述至少两个第一指纹记录项的所述第一指纹对应的数据的存储地址均不同;以及,
    对所述至少两个第一指纹记录项中的所述第一指纹对应的数据进行重复数据删除操作;以及,
    删除所述至少两个第一指纹记录项;以及,
    在所述指纹记录中记录所述第一指纹的存根;其中,所述第一指纹的存根用于指示所述第一指纹为重复指纹。
  8. 根据权利要求7所述的装置,其特征在于,所述处理器还用于:
    在所述指纹记录中记录第二指纹记录项;所述第二指纹记录项包含所述第一指纹以及所述第一指纹对应的数据的新的存储地址;其中,所述第二指纹记录项中的所述第一指纹对应的数据为新写入的数据;
    根据所述第一指纹的存根确定所述第二指纹记录项中的所述第一指纹为重复指纹;
    对所述新写入的数据进行重复数据删除操作。
  9. 根据权利要求8所述的装置,其特征在于,所述处理器还用于:
    删除所述第二指纹记录项。
  10. 根据权利要求7-9任一所述的装置,其特征在于,所述处理器还用于:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,删除第三指纹记录项,所述第三指纹记录项包括的指纹与所述指纹记录中其他指纹记录项包括的指纹均不同。
  11. 根据权利要求7-10中任一项所述的装置,其特征在于,所述处理器还用于:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,删除第四指纹记录项,所述第四指纹记录项保存在所述指纹记录中的时长大于或等于第二门限。
  12. 根据权利要求7-11中任一项所述的装置,其特征在于,所述处理器还用于:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,确定所述指纹记录在预定的时间内是否记录预定数量的第三指纹记录项;
    当所述指纹记录在所述预定的时间内未记录所述预定数量的第三指纹记录项时,删除所述指纹记录中的第二指纹的存根;其中,所述第二指纹的存根用于指示所述第二指纹为重复指纹;所述第三指纹记录项包含所述第二指纹。
  13. 一种重复数据的删除装置,其特征在于,包括:
    通信模块,用于获取指纹记录,所述指纹记录中包含多个指纹记录项,每个指纹记录项包含指纹;
    处理模块,用于从所述指纹记录中确定至少两个第一指纹记录项;其中,每个第一指纹记录项包含第一指纹和所述第一指纹对应的数据的存储地址;所述至少两个第一指纹记录项的所述第一指纹对应的数据的存储地址均不同;以及,
    对所述至少两个第一指纹记录项中的所述第一指纹对应的数据进行重复数据删除操作;以及,
    删除所述至少两个第一指纹记录项;以及,
    在所述指纹记录中记录所述第一指纹的存根;其中,所述第一指纹的存根用于指示所述第一指纹为重复指纹。
  14. 根据权利要求13所述的装置,其特征在于,所述处理模块还用于:
    在所述指纹记录中记录第二指纹记录项;所述第二指纹记录项包含所述第一指纹以及 所述第一指纹对应的数据的新的存储地址;其中,所述第二指纹记录项中的所述第一指纹对应的数据为新写入的数据;
    根据所述第一指纹的存根确定所述第二指纹记录项中的所述第一指纹为重复指纹;
    对所述新写入的数据进行重复数据删除操作。
  15. 根据权利要求14所述的装置,其特征在于,所述处理模块还用于:
    删除所述第二指纹记录项。
  16. 根据权利要求13-15任一所述的装置,其特征在于,所述处理模块还用于:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,删除第三指纹记录项,所述第三指纹记录项包括的指纹与所述指纹记录中其他指纹记录项包括的指纹均不同。
  17. 根据权利要求13-16中任一项所述的装置,其特征在于,所述处理模块还用于:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,删除第四指纹记录项,所述第四指纹记录项保存在所述指纹记录中的时长大于或等于第二门限。
  18. 根据权利要求13-17中任一项所述的装置,其特征在于,所述处理模块还用于:
    在所述指纹记录所占用的存储空间大于或等于第一门限时,确定所述指纹记录在预定的时间内是否记录预定数量的第三指纹记录项;
    当所述指纹记录在所述预定的时间内未记录所述预定数量的第三指纹记录项时,删除所述指纹记录中的第二指纹的存根;其中,所述第二指纹的存根用于指示所述第二指纹为重复指纹;所述第三指纹记录项包含所述第二指纹。
  19. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有指令,当所述指令在计算机上运行时,使得所述计算机执行如权利要求1-6任一项所述的方法。
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,当所述指令在计算机上运行时,使得所述计算机执行如权利要求1-6任一项所述的方法。
PCT/CN2020/104846 2019-08-14 2020-07-27 一种重复数据的删除方法及装置 WO2021027541A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20852903.2A EP4016276A4 (en) 2019-08-14 2020-07-27 DATA DEDUPLICATION METHOD AND APPARATUS
US17/671,224 US20220164316A1 (en) 2019-08-14 2022-02-14 Deduplication method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910748958.1 2019-08-14
CN201910748958.1A CN110618789B (zh) 2019-08-14 2019-08-14 一种重复数据的删除方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/671,224 Continuation US20220164316A1 (en) 2019-08-14 2022-02-14 Deduplication method and apparatus

Publications (1)

Publication Number Publication Date
WO2021027541A1 true WO2021027541A1 (zh) 2021-02-18

Family

ID=68921113

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104846 WO2021027541A1 (zh) 2019-08-14 2020-07-27 一种重复数据的删除方法及装置

Country Status (4)

Country Link
US (1) US20220164316A1 (zh)
EP (1) EP4016276A4 (zh)
CN (1) CN110618789B (zh)
WO (1) WO2021027541A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110618789B (zh) * 2019-08-14 2021-08-20 华为技术有限公司 一种重复数据的删除方法及装置
CN117331487A (zh) * 2022-06-24 2024-01-02 华为技术有限公司 一种数据重删方法及相关系统
CN115904166A (zh) * 2022-11-10 2023-04-04 贝壳找房(北京)科技有限公司 项目图标管理方法、电子设备及存储介质
CN117762336B (zh) * 2023-12-22 2024-07-26 柏域信息科技(上海)有限公司 针对Ceph对象的定期删除方法、装置及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831222A (zh) * 2012-08-24 2012-12-19 华中科技大学 一种基于重复数据删除的差量压缩方法
CN102915278A (zh) * 2012-09-19 2013-02-06 浪潮(北京)电子信息产业有限公司 重复数据删除方法
US20150199270A1 (en) * 2011-09-02 2015-07-16 John Day-Richter System and Method for Performing Data Management in a Collaborative Development Environment
CN106990914A (zh) * 2017-02-17 2017-07-28 深圳市中博睿存信息技术有限公司 数据删除方法及装置
CN110618789A (zh) * 2019-08-14 2019-12-27 华为技术有限公司 一种重复数据的删除方法及装置

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8412682B2 (en) * 2006-06-29 2013-04-02 Netapp, Inc. System and method for retrieving and using block fingerprints for data deduplication
US7908436B1 (en) * 2008-04-25 2011-03-15 Netapp, Inc. Deduplication of data on disk devices using low-latency random read memory
US8904120B1 (en) * 2010-12-15 2014-12-02 Netapp Inc. Segmented fingerprint datastore and scaling a fingerprint datastore in de-duplication environments
US8898119B2 (en) * 2010-12-15 2014-11-25 Netapp, Inc. Fingerprints datastore and stale fingerprint removal in de-duplication environments
US8930307B2 (en) * 2011-09-30 2015-01-06 Pure Storage, Inc. Method for removing duplicate data from a storage array
KR20130064518A (ko) * 2011-12-08 2013-06-18 삼성전자주식회사 저장 장치 및 그것의 동작 방법
CN102629258B (zh) * 2012-02-29 2013-12-18 浪潮(北京)电子信息产业有限公司 重复数据删除方法和装置
CN103870514B (zh) * 2012-12-18 2018-03-09 华为技术有限公司 重复数据删除方法和装置
WO2015089728A1 (zh) * 2013-12-17 2015-06-25 华为技术有限公司 重复数据处理方法、装置及存储控制器和存储节点
US9798728B2 (en) * 2014-07-24 2017-10-24 Netapp, Inc. System performing data deduplication using a dense tree data structure
CN106610790B (zh) * 2015-10-26 2020-01-03 华为技术有限公司 一种重复数据删除方法及装置
US10235396B2 (en) * 2016-08-29 2019-03-19 International Business Machines Corporation Workload optimized data deduplication using ghost fingerprints
US10620862B2 (en) * 2017-03-01 2020-04-14 Tintri By Ddn, Inc. Efficient recovery of deduplication data for high capacity systems
CN107656966A (zh) * 2017-08-28 2018-02-02 深圳市诚壹科技有限公司 一种处理数据的方法及服务器
US11144227B2 (en) * 2017-09-07 2021-10-12 Vmware, Inc. Content-based post-process data deduplication
US10642522B2 (en) * 2017-09-15 2020-05-05 Alibaba Group Holding Limited Method and system for in-line deduplication in a storage drive based on a non-collision hash
CN108427538B (zh) * 2018-03-15 2021-06-04 深信服科技股份有限公司 全闪存阵列的存储数据压缩方法、装置、及可读存储介质
CN108415669A (zh) * 2018-03-15 2018-08-17 深信服科技股份有限公司 存储系统的数据去重方法及装置、计算机装置及存储介质
CN108427539B (zh) * 2018-03-15 2021-06-04 深信服科技股份有限公司 缓存设备数据的离线去重压缩方法、装置及可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150199270A1 (en) * 2011-09-02 2015-07-16 John Day-Richter System and Method for Performing Data Management in a Collaborative Development Environment
CN102831222A (zh) * 2012-08-24 2012-12-19 华中科技大学 一种基于重复数据删除的差量压缩方法
CN102915278A (zh) * 2012-09-19 2013-02-06 浪潮(北京)电子信息产业有限公司 重复数据删除方法
CN106990914A (zh) * 2017-02-17 2017-07-28 深圳市中博睿存信息技术有限公司 数据删除方法及装置
CN110618789A (zh) * 2019-08-14 2019-12-27 华为技术有限公司 一种重复数据的删除方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4016276A4 *

Also Published As

Publication number Publication date
CN110618789B (zh) 2021-08-20
EP4016276A4 (en) 2022-10-26
US20220164316A1 (en) 2022-05-26
EP4016276A1 (en) 2022-06-22
CN110618789A (zh) 2019-12-27

Similar Documents

Publication Publication Date Title
WO2021027541A1 (zh) 一种重复数据的删除方法及装置
CN108319654B (zh) 计算系统、冷热数据分离方法及装置、计算机可读存储介质
US11853549B2 (en) Index storage in shingled magnetic recording (SMR) storage system with non-shingled region
JP5087467B2 (ja) コンピュータストレージシステムにおいてデータ圧縮並びに整合性を管理する方法および装置
WO2021073635A1 (zh) 一种数据存储方法及装置
WO2021043026A1 (zh) 一种存储空间的管理方法及装置
GB2518158A (en) Method and system for data access in a storage infrastructure
US11550486B2 (en) Data storage method and apparatus
US11227635B2 (en) Recording device, readout device, recording method, recording program, readout method, readout program, and magnetic tape
TW201140430A (en) Allocating storage memory based on future use estimates
WO2021036689A1 (zh) 一种缓存空间的管理方法及装置
CN108089825B (zh) 一种基于分布式集群的存储系统
WO2020098654A1 (zh) 基于云存储的数据存储方法、装置和存储介质
US11392545B1 (en) Tracking access pattern of inodes and pre-fetching inodes
RU2665272C1 (ru) Способ и устройство для восстановления дедуплицированных данных
WO2017020735A1 (zh) 一种数据处理方法、备份服务器及存储系统
WO2022171000A1 (zh) 一种数据迁移方法、系统、设备以及介质
CN118152434A (zh) 数据管理方法及计算设备
WO2024032015A1 (zh) 数据缩减方法、装置及系统
US11249666B2 (en) Storage control apparatus
EP4040279A1 (en) Method and apparatus for accessing solid state disk
CN111966845B (zh) 图片管理方法、装置、存储节点及存储介质
WO2023029417A1 (zh) 一种数据存储方法及装置
US11526495B2 (en) Method and apparatus for processing write-ahead log
WO2021063242A1 (zh) 一种存储系统的元数据的发送方法及存储系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20852903

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020852903

Country of ref document: EP

Effective date: 20220314