CN107122130B - Data deduplication method and device - Google Patents

Data deduplication method and device Download PDF

Info

Publication number
CN107122130B
CN107122130B CN201710239910.9A CN201710239910A CN107122130B CN 107122130 B CN107122130 B CN 107122130B CN 201710239910 A CN201710239910 A CN 201710239910A CN 107122130 B CN107122130 B CN 107122130B
Authority
CN
China
Prior art keywords
deduplication
logical address
data
preset
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710239910.9A
Other languages
Chinese (zh)
Other versions
CN107122130A (en
Inventor
扈海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Macrosan Technologies Co Ltd
Original Assignee
Macrosan Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Macrosan Technologies Co Ltd filed Critical Macrosan Technologies Co Ltd
Priority to CN201710239910.9A priority Critical patent/CN107122130B/en
Publication of CN107122130A publication Critical patent/CN107122130A/en
Application granted granted Critical
Publication of CN107122130B publication Critical patent/CN107122130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0607Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data deduplication method and a data deduplication device, wherein the method is applied to a storage device and can comprise the following steps: sequentially reading logical address mapping records from the logical address mapping table; judging whether the value of the preset identification bit in the read logical address mapping record is a first preset value or not; and if the value of the identification bit is a first preset value, performing deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication strategy. Since the written data which is not subjected to the deduplication processing can be determined from the value of the preset identification bit as the first preset value, the deduplication processing can be performed on the written data which is not subjected to the deduplication processing, and therefore the efficiency of the deduplication processing can be improved.

Description

Data deduplication method and device
Technical Field
The application relates to the field of storage, in particular to a data deduplication technology.
Background
The deduplication technology is a storage technology that automatically searches for duplicate data and retains a unique copy of the same data. Through the deduplication processing, redundant data of a storage system can be eliminated, and the requirement on storage capacity is reduced.
The currently popular deduplication technology is an online deduplication technology implemented based on a Hash (translation to Hash or Hash) algorithm. For example, calculating the hash value of the data block, retaining the new data and deleting the repeated data through hash value matching. However, the hash value calculation and matching process consumes a lot of system resources, and especially, the data block that has been subjected to deduplication is subjected to deduplication again, which greatly increases the consumption of system resources and reduces the efficiency of deduplication.
Disclosure of Invention
In view of the above, the present application provides a data deduplication method and apparatus, so as to improve deduplication efficiency.
Specifically, the method is realized through the following technical scheme:
according to a first aspect of the present application, a data deduplication method is provided, where the method is applied to a storage device, where the storage device includes a preconfigured logical address mapping table, where the logical address mapping table includes a number of logical address mapping records; the logical address mapping record comprises a mapping relation of a logical address, a physical address and a preset identification bit of written data; wherein, in the logical address mapping record corresponding to the written data which does not complete the deduplication processing, the value of the preset identification bit is a first preset value, and the method includes:
sequentially reading logical address mapping records from the logical address mapping table;
judging whether the value of the preset identification bit in the read logical address mapping record is a first preset value or not;
and if the value of the identification bit is a first preset value, performing deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication strategy.
According to a second aspect of the present application, a data deduplication apparatus is provided, where the apparatus is applied to a storage device, where the storage device includes a preconfigured logical address mapping table, where the logical address mapping table includes a plurality of logical address mapping records; the logical address mapping record comprises a mapping relation of a logical address, a physical address and a preset identification bit of written data; wherein, in the logical address mapping record corresponding to the written data that does not complete the deduplication processing, the value of the preset identification bit is a first preset value, and the apparatus includes:
sequentially reading logical address mapping records from the logical address mapping table;
judging whether the value of the preset identification bit in the read logical address mapping record is a first preset value or not;
and if the value of the identification bit is a first preset value, performing deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication strategy.
The method for deduplication processing is characterized in that a preset identification bit for identifying a deduplication state is added in a logical address mapping record of a logical address mapping table, and after a storage device judges that a value of the preset identification bit in the read logical address mapping record is a first preset value, deduplication processing can be performed on written data corresponding to the logical address mapping based on a preset deduplication strategy.
Because the preset identification bit for identifying the deduplication state of the data to be deduplication is added, the storage device can distinguish the written data which is not subjected to deduplication processing, and then perform deduplication processing on the written data which is not subjected to deduplication processing, so that the resource consumption of deduplication processing is greatly reduced, and the deduplication processing efficiency is improved.
Drawings
Fig. 1 is a schematic diagram illustrating a related data deduplication technology according to an exemplary embodiment of the present application;
FIG. 2 is a flow chart illustrating a method for data deduplication according to an exemplary embodiment of the present application;
fig. 3 is a schematic diagram illustrating a data deduplication method according to an exemplary embodiment of the present application;
fig. 4 is a hardware structure diagram of a device in which a data deduplication apparatus according to an exemplary embodiment of the present application is located;
fig. 5 is a block diagram illustrating a data deduplication apparatus according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The deduplication technology is a storage technology that automatically searches for duplicate data and retains a unique copy of the same data. Through the deduplication processing, redundant data of a storage system can be eliminated, and the requirement on storage capacity is reduced.
In general, the deduplication techniques may include online deduplication and background deduplication.
The online deduplication refers to that after the storage device receives the write IO request, the storage device may perform deduplication processing on target data in the write IO request, and then determine whether to write the target data based on a status after the deduplication processing.
The background deduplication refers to that the storage device reads locally written target data and then performs deduplication processing on the read target data.
It should be noted that the online deduplication and the background deduplication have the same deduplication processing process, but the online deduplication mainly performs deduplication processing on data to be written in the write IO request, and the background deduplication performs deduplication processing on the written data corresponding to the logical address mapping record in the logical address mapping table.
For better understanding of the present application, the following describes the flow of data deduplication processing in the related art in detail.
Referring to fig. 1, fig. 1 is a schematic diagram illustrating a deduplication technique in a related data deduplication technology according to an exemplary embodiment of the present application.
Here, (a1) in fig. 1 is a logical address mapping table. The logical address mapping table includes a plurality of logical address mapping records for recording the logical address and the physical address of the written data.
Fig. 1 (b1) shows a deduplication mapping table. The deduplication mapping table comprises a plurality of deduplication mapping records of hash values and physical addresses of the written data which are subjected to deduplication processing.
Upon deduplication, the storage device may calculate a hash value (e.g., H _ x) of the target Data (e.g., Data _0), may then look up a deduplication mapping table for a matching deduplication mapping record with the calculated hash value (e.g., H _ x), and determine a physical address (e.g., B _ x1) of the written Data corresponding to the hash value (e.g., H _ x) that has completed the deduplication process by deduplication mapping record.
The storage device may read the written Data (e.g., Data _0 shown in fig. B2) that has completed the deduplication processing by using the physical address B _ x1, and modify the physical address (e.g., B _ x) of the target Data in the logical address mapping record corresponding to the target Data to the physical address (e.g., B _ x1) of the written Data that has completed the deduplication processing after determining that the read written Data that has completed the deduplication processing is identical to the target Data.
However, in the related data deduplication technology, it is difficult to determine whether the target data has been subjected to online or background deduplication processing, so that the target data subjected to deduplication processing is subjected to background deduplication processing, which greatly occupies resources of the device and reduces efficiency of background deduplication.
In addition, the related deduplication technology lacks a mechanism for automatically controlling start-stop and switching of online deduplication or background deduplication, so that the flexibility of deduplication scheduling is greatly reduced.
The method for deduplication processing is characterized in that a preset identification bit for identifying a deduplication state is added in a logical address mapping record of a logical address mapping table, and after a storage device judges that a value of the preset identification bit in the read logical address mapping record is a first preset value, deduplication processing can be performed on written data corresponding to the logical address mapping based on a preset deduplication strategy.
On one hand, because the preset identification bits for identifying the deduplication state of the target data are added, the storage device can determine the target data which is not subjected to deduplication processing, and then perform deduplication processing on the target data, so that the resource consumption of deduplication processing is greatly reduced, and the efficiency of deduplication processing is improved.
On the other hand, the storage device may count the number of logical address mapping records with the preset identification bit value as the first preset value, or determine the occupancy rate of the storage space of the current storage device according to the current IO performance index, or determine the current IO performance state of the storage device, so as to dynamically perform deduplication scheduling of background deduplication and online deduplication.
Referring to fig. 2, fig. 2 is a flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application. The data deduplication method can be applied to a storage device and can include the following steps.
Step 201: sequentially reading logical address mapping records from the logical address mapping table;
step 202: judging whether the value of the preset identification bit in the read logical address mapping record is a first preset value or not;
step 203: and if the value of the identification bit is a first preset value, performing deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication strategy.
The storage device stores a logical address mapping table and a deduplication mapping table.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating a data deduplication method according to an exemplary embodiment of the present application.
Fig. 3(a1) is a logical address mapping table proposed in the present application, which contains several logical address mapping records. The logical address mapping record includes a mapping relationship between a logical address of written data, a physical address of the written data, and a preset identification bit.
The preset identification bits are used for identifying the deduplication processing state of the written data, and the values of the preset identification bits are different, and the deduplication processing states of the written data corresponding to the preset identification bits are different. For example, for data that is not subjected to deduplication processing, the value of the flag bit is a first preset value, which may be 0, for example. For the data that has been subjected to the deduplication processing, the value of the flag may be a second preset value, for example, 1. For data with hash value collision (i.e. one hash value corresponds to two different written data that have been subjected to deduplication processing), the data may be considered to have not been subjected to deduplication processing, and the value of the flag bit may be a first preset value, for example, 0. Of course, for the data with the hash value conflict, the value of the identification bit may also be a third preset value, as long as the data with the hash value conflict can be identified as the data without being subjected to the deduplication processing.
In general, in the above logical address mapping record, the length of the logical address of the written data is 8 bytes, that is, 64 bits, and the length of the physical address of the written data is also 8 bytes. When designing the logical address mapping record, 1 bit can be divided from the physical address length as the identification bit, and 1 bit can also be divided from the logical address length as the identification bit. Here, the flag bit is set in the logical address mapping record by way of example, and is not particularly limited.
It should be noted that, a bit is divided from the logical address mapping table as an identification bit, which, on one hand, can save memory space without causing more memory space consumption;
on the other hand, because the logical address mapping table is expanded by dividing one bit in the logical address mapping table as the identification bit, instead of identifying the deduplication processing state of the written data by adding other tables or other mapping relations, the steps of identifying the deduplication state of the target data to be deduplication processed are effectively reduced, and the deduplication processing efficiency is improved.
More importantly, the logical address mapping table is used as core metadata of the storage device, and address information and the like of a memory where the logical address mapping table is located are generally stored in the Cache, so that the preset identification bit is set on the logical address mapping table, and the rate of the storage device accessing the deduplication processing state of the written data can be effectively improved.
Referring to fig. 3, fig. 3(b1) is a deduplication mapping table. The deduplication mapping table comprises a plurality of deduplication mapping records. The deduplication mapping record includes a mapping relationship between a hash value of the written data for which the deduplication processing has been completed and a physical address of the written data for which the deduplication processing has been completed.
The preset deduplication strategy may include: the storage device may calculate a hash value of the target data to be subjected to deduplication processing, and may match the calculated hash value with the hash values in the deduplication mapping table in sequence.
If the storage device does not match the hash value in the deduplication mapping table, the storage device may take a value of the preset identification bit in the logical address mapping record corresponding to the target data as the second preset value.
In an alternative implementation, for online deduplication, the storage device may allocate storage space for the target data and write the target data. The storage device may generate a logical address mapping record with a logical address as the logical address of the target data, a physical address as the address of the allocated storage space, and a value of the preset identification bit as a second preset value for the target data, and add the logical address mapping record to the logical address mapping table.
In another alternative implementation, for background deduplication, the storage device may add a deduplication mapping record of the hash value of the target data and the physical address of the target data in the deduplication mapping table. Meanwhile, the storage device sets the value of a preset identification bit of a logical address mapping record aiming at the target data in the logical address mapping table to be a second preset value.
If the hash value is matched in the deduplication mapping table, the storage device may read the written data that has been subjected to deduplication processing and corresponds to the physical address for which the hash value has a mapping relationship, and may compare the target data with the read written data that has been subjected to deduplication processing.
If the target data is different from the read written data that has been subjected to deduplication processing, the storage device may set the value of the preset identification bit in the logical address mapping record corresponding to the target data to be the first preset value.
In an optional implementation manner, for online deduplication, the storage device may allocate a storage space for the target data, generate a logical address mapping record whose logical address is a logical address of the target data, whose physical address is an address of the allocated storage space, and whose value of the preset identification bit is a first preset value, and add the logical address mapping record to the logical address mapping table.
In another optional implementation manner, for background deduplication, if the target data is different from the read written data that has completed deduplication, the storage device may set, as a first preset value, a value of a preset identification bit in a logical address mapping record corresponding to the target data in a logical address mapping table.
If the target data is the same as the read data of the written data that has completed the deduplication process, the storage device may set a value of a preset identification bit in a logical address mapping record corresponding to the target data to a second preset value, and set a physical address to a physical address of the written data that has completed the deduplication process.
In an optional implementation manner, for online deduplication, the storage device may discard the target data, and at the same time, generate a logical address mapping record for the target data, where a logical address in the generated logical address mapping record is a logical address of the target data, a physical address is a physical address of the written data that has completed deduplication, and a preset identification bit value is a second preset value.
In another optional implementation manner, for background deduplication, if the target data is the same as the read data of the written data that has completed deduplication, the storage device may modify a physical address in a logical address record corresponding to the target data to a physical address of the written data that has completed deduplication, and a value of the preset identification bit is a second preset value. Meanwhile, the storage device may recycle the target data.
For online deduplication, the target data to be subjected to deduplication processing may be data to be written in an IO write request. For background deduplication, the target data to be deduplicated may be written data corresponding to a logical address mapping record.
In this embodiment, the storage device may generate a logical address mapping record with a preset identification bit value as a first preset value for data to be written that is not successfully subjected to online or background deduplication processing or data to be written that is not subjected to online deduplication. In the background deduplication process, the storage device may perform, based on a preset deduplication policy, background deduplication processing only on written data corresponding to a logical address mapping record of which a preset identification bit in the logical address mapping table is a first preset value. Since the background deduplication processing is not performed on the data which is subjected to the deduplication processing, the efficiency of the background deduplication can be greatly improved.
The data deduplication method proposed by the present application is described in detail below in terms of online deduplication and background deduplication, respectively.
1) Online deduplication
In the embodiment of the application, in order to reduce data writing while meeting the current IO performance of the storage device, the storage device can judge whether the trigger condition of online deduplication is met after receiving a write IO request, and if so, online deduplication can be performed.
During implementation, the storage device can count current IO performance indexes, judge whether the counted IO performance indexes respectively meet preconfigured performance index thresholds, and if the IO performance indexes all meet the preconfigured performance index thresholds, indicate that the current storage device has enough resources to perform online deduplication, and reduce data writing through online deduplication.
If any IO performance index does not meet the preconfigured performance index threshold, the storage device may not start online deduplication, write the to-be-written data carried in the received write IO request to the local, and generate a logical address mapping record with a preset identification bit value as a first preset value for the to-be-written data.
The IO performance may include an IO number, an IO delay, an IO throughput, and the like, and here, the IO performance is only exemplarily described and is not specifically limited.
The preconfigured performance index threshold may include an IO number threshold, an IO delay threshold, an IO throughput threshold, and the like, and may be configured by a user, and the performance index threshold and the configuration thereof are only exemplarily described herein, and are not specifically limited.
After determining that each IO performance index meets the preconfigured performance index threshold, the storage device may calculate a hash value of the data to be written carried in the received write IO request.
The storage device may calculate the hash value of the written data by a hash value calculation method, or may calculate the hash value by another hash value calculation method, and the calculation of the hash value of the written data is not particularly limited herein.
After the hash value is obtained through calculation, the storage device may match the calculated hash value with the hash values in the deduplication mapping table in sequence.
If the hash value is not matched in the deduplication mapping table, the storage device may allocate a storage space for the data to be written and may write the data to be written. The storage device may generate a logical address for the data to be written, where the logical address is the logical address of the data to be written, and the physical address is the address of the allocated storage space, and a logical address mapping record where the preset identification bit value is a second preset value.
If the hash value is matched in the deduplication mapping table, the physical address corresponding to the hash value can be read, and the written data which has completed the deduplication processing can be read through the physical address.
The storage device may compare whether the data to be written and the written data that has completed the deduplication process are the same. If the data to be written is the same as the written data which has completed the deduplication processing, a logical address mapping record can be generated for the data to be written, wherein the logical address is the logical address of the data to be written, the physical address is the physical address of the written data which has completed the deduplication processing, and the value of the preset flag bit is the second preset value. Meanwhile, the storage device may discard the data to be written.
If the data to be written is different from the written data which has finished the deduplication processing, the storage device may allocate a storage space for the data to be written and may write the data to be written. The storage device may generate a logical address for the data to be written, where the logical address is the logical address of the data to be written, and the physical address is a logical address mapping record where the address of the allocated storage space and the preset identification bit value are the first preset value.
For example, as shown in fig. 3, assuming that the Data to be entered is Data _0, the storage device may calculate a hash value of Data _0, and assume that the calculated hash value is H _ x. The storage device may look up a deduplication mapping record matching the H _ x in a deduplication mapping table as shown in fig. 3(B1), and determine a physical address, such as B _ x1, of the data corresponding to the H _ x, which has completed the deduplication process, through the deduplication mapping record.
As shown in (B2) of fig. 3, the memory device may read the written Data _0 corresponding to B _ x1 that has completed the deduplication process. After the storage device determines that the Data to be written Data _0 and the read written Data _0 for which the deduplication process has been completed are the same, a logical address mapping record as shown in fig. 3(a1) may be generated for the Data to be written. The logical address in the logical address mapping record is the logical address Addr _0 of the data to be written, the physical address is the physical address B _ x1 of the written data that has completed the deduplication processing, and the identification bit takes the value of the second preset value 1. The storage device may add the logical address mapping record to a logical address mapping table. Meanwhile, the storage device may discard the data to be written.
If the Data to be written is Data _1, the storage device can calculate the hash value of Data _1, and the calculated hash value is assumed to be H _ y. The storage device may look up a deduplication mapping record matching the H _ y in a deduplication mapping table as shown in fig. 3(B1), and determine a physical address of the data corresponding to the H _ y, such as B _ y1, for which deduplication processing has been completed, through the deduplication mapping record.
The storage device may read the corresponding written Data _ m of which the deduplication process has been completed by B _ y 1. After the storage device determines that the written Data _ m is different from the read written Data _0 that has completed the deduplication process, the storage device may allocate a storage space for the Data to be written, and may write the Data to be written. The storage device may generate a logical address mapping record as shown in fig. 3(a1) for the data to be written, where a logical address in the logical address mapping record is a logical address (e.g., Addr _1) of the data to be written, a physical address is an address (e.g., B _ y) of the allocated storage space, and the preset identification bit takes a value of a logical address mapping record of a first preset value (e.g., 0). The storage device may add the logical address mapping record to the logical address mapping table.
If the Data to be written is Data _2, the storage device can calculate the hash value of Data _2, and the calculated hash value is assumed to be H _ q. The storage device may look up a deduplication mapping record matching the H _ q in a deduplication mapping table as shown in fig. 3(b 1). After the storage device determines that the deduplication mapping record matching the H _ q is not found, the storage device may allocate a storage space for the data to be written, and may write the data to be written. The storage device may generate a logical address mapping record as shown in fig. 3(a1) for the data to be written, where a logical address in the logical address mapping record is a logical address (e.g., Addr _2) of the data to be written, a physical address is an address (e.g., B _ q) of the allocated storage space, and the preset identification bit takes a second preset value (e.g., 1).
2) Background deduplication
In order to perform start-stop control and the like of background deduplication according to the current storage space occupancy rate of the storage device and achieve timely reduction of the current storage space occupancy rate of the storage device, the storage device can judge whether to perform background deduplication based on a triggering condition of the background deduplication.
The triggering condition of the background deduplication can include: the storage device may count the number of logical address mapping records in which the preset identification bits in the logical address mapping table take on the first preset value. The storage device can judge whether the counted number of the logical address mapping records reaches a preset threshold value, if the counted number of the logical address mapping records reaches the preset threshold value, the current storage space occupancy rate of the storage device is over high, and the background can be started for re-deletion so as to reduce the storage space occupancy rate. If the counted number of the logical address mapping records does not reach the preset threshold value, it indicates that the current storage space of the storage device can meet the storage requirement, and at this time, the background deduplication can not be started.
After the counted number of the logical address mapping records is determined to reach the preset threshold, the storage device may sequentially read the logical address mapping records from the logical address mapping table. Then, the storage device may determine whether a value of the preset identification bit in the read logical address mapping record is a first preset value. And if the value of the identification bit is a first preset value, indicating that the written data corresponding to the logical address mapping record needs to be subjected to deduplication processing.
In the deduplication process, the storage device may calculate a hash value of the written data, and may sequentially match the calculated hash value with hash values in the deduplication mapping table.
If the hash value is not matched in the deduplication mapping table, the identification bit value in the logical address mapping record corresponding to the written data may be set to a second preset value, and meanwhile, the deduplication mapping record of the hash value of the written data and the physical address of the written data is added in the deduplication mapping table.
If the hash value is matched in the deduplication mapping table, the physical address of the written data corresponding to the hash value and having completed the deduplication process can be read. And based on the physical address, reading the written data which completes the deduplication processing.
The storage device may compare whether the written data is the same as the written data that has completed the deduplication process, and if the written data is the same as the written data that has completed the deduplication process, the storage device may set a value of the identification bit in the logical address mapping record corresponding to the written data to a second preset value, and modify a physical address of the written data in the logical address mapping record to a physical address of the written data that has completed the deduplication process.
If the written data is different from the written data that has completed the deduplication processing, the storage device may not change the logical address mapping record corresponding to the written data, and the identification bit value of the logical address mapping record is still the first preset value.
For example, as shown in fig. 3, if the written Data corresponding to the read logical address mapping record is Data _0 shown in fig. 3(a2), the storage device may calculate the hash value of Data _0, and assume that the calculated hash value is H _ x. The storage device may look up a deduplication mapping record matching the H _ x in a deduplication mapping table as shown in fig. 3(B1), and determine a physical address, such as B _ x1, of the data corresponding to the H _ x, which has completed the deduplication process, through the deduplication mapping record.
As shown in fig. 3(B2), the memory device can read the written Data _0 corresponding to B _ x1 that has completed the deduplication process. After the written Data _0 and the read Data _0 that have completed the deduplication processing are the same, the physical address B _ x corresponding to the Data _0 in the logical address mapping record corresponding to the written Data _0 may be modified to the physical address B _ x1 corresponding to the written Data that have completed the deduplication processing, and the value of the preset identification bit may be modified to a second preset value, for example, the second preset value may be 1. The storage device may recycle the written Data _ 0.
If the written Data corresponding to the read logical address mapping record is Data _1 shown in fig. 3(a2), the storage device may calculate the hash value of Data _1, assuming that the calculated hash value is H _ y. The storage device may look up a deduplication mapping record matching the H _ y in a deduplication mapping table as shown in fig. 3(B1), and determine a physical address of the data corresponding to the H _ y, such as B _ y1, for which deduplication processing has been completed, through the deduplication mapping record.
The storage device may read the corresponding written Data _ m of which the deduplication process has been completed by B _ y 1. After the storage device determines that the written Data _ m is different from the read written Data _0 which has completed the deduplication processing, the logical address mapping record corresponding to the written Data may not be modified, and the identification bit value of the logical address mapping record is still the first preset value (e.g. 0).
If the written Data corresponding to the read logical address mapping record is Data _2 shown in fig. 3(a2), the storage device may calculate the hash value of Data _2, assuming that the calculated hash value is H _ q. The storage device may look up a deduplication mapping record matching the H _ q in a deduplication mapping table as shown in b1 of fig. 3. After the storage device determines that the deduplication mapping record matching the H _ q is not found, the storage device may modify the preset identification bit value in the logical address mapping record corresponding to the written data to a second preset value (e.g., 1).
In the embodiment of the present application, if the written data and the written data that has completed the deduplication processing are different, which may indicate that a hash value collision has occurred, the storage device may determine that the deduplication processing of the written data has not been successful. The storage device may not change the logical address mapping record whose identification bit value corresponding to the written data is the first preset value.
The method for deduplication processing is characterized in that a preset identification bit for identifying a deduplication state is added in a logical address mapping record of a logical address mapping table, and after a storage device judges that a value of the preset identification bit in the read logical address mapping record is a first preset value, deduplication processing can be performed on written data corresponding to the logical address mapping based on a preset deduplication strategy.
On one hand, due to the fact that the preset identification bits for identifying the deduplication state of the data to be deduplication are added, the storage device can determine the written data which is not subjected to online or background deduplication processing, and then background deduplication processing is performed on the written data, so that resource consumption of the background deduplication processing is greatly reduced, and efficiency of the background deduplication processing is improved.
On the other hand, the storage device may count the number of logical address mapping records with the preset identification bit value as the first preset value, or determine the storage space occupancy rate of the current storage device according to the current IO performance index, or determine the current IO performance state of the storage device, so as to dynamically perform deduplication scheduling such as background deduplication, online deduplication start/stop and switching.
Corresponding to the foregoing embodiments of the data deduplication method, the present application also provides embodiments of a data deduplication apparatus.
The embodiment of the data deduplication device can be applied to storage equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading a corresponding computer program instruction in the nonvolatile memory into the memory through the processor of the storage device where the device is located to operate. From a hardware aspect, as shown in fig. 4, a hardware structure diagram of a storage device where the data deduplication apparatus of the present application is located is shown, except for the processor, the memory, the network output interface, and the nonvolatile memory shown in fig. 4, the storage device where the apparatus is located in the embodiment may also include other hardware according to the actual function of the storage device, which is not described again.
Referring to fig. 5, fig. 5 is a block diagram illustrating a data deduplication apparatus according to an exemplary embodiment of the present application. The device is applied to a storage device, wherein the storage device stores a logical address mapping table, the logical address mapping table comprises a plurality of logical address mapping records, and the logical address mapping records comprise mapping relations of logical addresses, physical addresses and preset identification bits of written data; wherein, the value of the preset identification bit in the logical address mapping record corresponding to the written data which does not complete the deduplication processing is a first preset value, and the apparatus includes: a reading unit 510, a judging unit 520 and a deduplication processing unit 530.
A reading unit 510, configured to sequentially read logical address mapping records from the logical address mapping table;
a determining unit 520, configured to determine whether a value of the preset identification bit in the read logical address mapping record is a first preset value;
a deduplication processing unit 530, configured to perform deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication policy if the value of the identification bit is a first preset value.
In an optional implementation, the apparatus further includes: a marking unit 540, configured to perform deduplication processing on data to be written carried in a write IO request based on a preset deduplication policy in response to a received write IO request, and if deduplication processing is not successful for the data to be written; or, the data to be written carried in the write IO request is not subjected to deduplication processing, and the logical address mapping record with the value of the preset identification bit being the first preset value is generated for the data to be written in the logical address mapping table.
In another optional implementation manner, the storage device further includes a pre-configured deduplication mapping table; the deleted mapping table comprises a plurality of deleted mapping records; wherein, the deduplication mapping record comprises a mapping relation between a hash value and a physical address of the written data which has completed deduplication processing;
the preset deduplication strategy comprises the following steps: calculating a hash value of target data to be subjected to deduplication processing; matching the calculated hash value with the hash value in the deduplication mapping table in sequence; if the hash value is matched in the deduplication mapping table, reading written data which is corresponding to a physical address with a mapping relation with the hash value and has been subjected to deduplication processing; comparing the target data with the read written data which has finished the deduplication processing; and if the target data is different from the read written data which is subjected to the deduplication processing, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data as the first preset value.
In another optional implementation manner, the preset deduplication policy further includes: and if the hash value is not matched in the deduplication mapping table, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data as the second preset value.
In another optional implementation manner, the preset deduplication policy further includes: and if the target data is the same as the written data which has finished the deduplication processing, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data to be the second preset value, and setting the physical address to be the physical address of the written data which has finished the deduplication processing.
In another optional implementation manner, the apparatus further includes a first triggering unit 550, configured to count the number of logical address mapping records in the logical address mapping table, where the preset identification bit takes the first preset value; and judging whether the counted number of the logical address mapping records reaches a preset threshold value, and if so, executing the step of sequentially reading the logical address mapping records from the logical address mapping table.
In another optional implementation manner, the apparatus further includes: the second trigger unit 560 is configured to count current IO performance indexes; judging whether the counted IO performance indexes respectively meet the preconfigured performance index threshold, if so, executing a step of performing deduplication on to-be-written data carried in the write IO request based on a preset deduplication strategy; and if not, performing deduplication processing on the data to be written carried in the write IO request.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims (12)

1. A data deduplication method is applied to a storage device, wherein the storage device comprises a preconfigured logical address mapping table, and the logical address mapping table comprises a plurality of logical address mapping records; the logical address mapping record comprises a mapping relation of a logical address, a physical address and a preset identification bit of written data; wherein, in the logical address mapping record corresponding to the written data which does not complete the deduplication processing, the value of the preset identification bit is a first preset value, and the method includes:
sequentially reading logical address mapping records from the logical address mapping table;
judging whether the value of the preset identification bit in the read logical address mapping record is a first preset value or not;
if the value of the identification bit is a first preset value, performing deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication strategy;
responding to a received write IO request, and determining whether to delete data to be written carried in the write IO request again;
if the data to be written carried in the write IO request is subjected to deduplication processing based on a preset deduplication strategy, if the deduplication processing is not successful for the data to be written; alternatively, the first and second electrodes may be,
if the data to be written carried in the write IO request is not subjected to deduplication processing, generating the logical address mapping record with the value of the preset identification bit as the first preset value for the data to be written in the logical address mapping table.
2. The method of claim 1, wherein the storage device further comprises a pre-configured deduplication mapping table; the deleted mapping table comprises a plurality of deleted mapping records; wherein, the deduplication mapping record comprises a mapping relation between a hash value and a physical address of the written data which has completed deduplication processing;
the preset deduplication strategy comprises the following steps:
calculating a hash value of target data to be subjected to deduplication processing;
matching the calculated hash value with the hash value in the deduplication mapping table in sequence;
if the hash value is matched in the deduplication mapping table, reading written data which is corresponding to a physical address with a mapping relation with the hash value and has been subjected to deduplication processing;
comparing the target data with the read written data which has finished the deduplication processing;
and if the target data is different from the read written data which is subjected to the deduplication processing, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data as the first preset value.
3. The method of claim 2, wherein the preset deduplication strategy further comprises:
and if the hash value is not matched in the deduplication mapping table, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data as a second preset value.
4. The method of claim 2, wherein the preset deduplication strategy further comprises:
and if the target data is the same as the written data which has finished the deduplication processing, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data to be a second preset value, and setting the physical address to be the physical address of the written data which has finished the deduplication processing.
5. The method according to claim 1, further comprising, before said sequentially reading logical address mapping records from said logical address mapping table:
counting the number of the logic address mapping records with the preset identification bit value as a first preset value in the logic address mapping table;
and judging whether the counted number of the logical address mapping records reaches a preset threshold value, and if so, executing the step of sequentially reading the logical address mapping records from the logical address mapping table.
6. The method according to claim 1, before performing deduplication processing on data to be written carried in the write IO request based on a preset deduplication policy, further comprising:
counting the current IO performance indexes;
judging whether the counted IO performance indexes respectively meet the preconfigured performance index threshold, if so, executing a step of performing deduplication on to-be-written data carried in the write IO request based on a preset deduplication strategy; and if not, performing deduplication processing on the data to be written carried in the write IO request.
7. A data deduplication device is applied to a storage device, wherein the storage device comprises a preconfigured logical address mapping table, and the logical address mapping table comprises a plurality of logical address mapping records; the logical address mapping record comprises a mapping relation of a logical address, a physical address and a preset identification bit of written data; wherein, in the logical address mapping record corresponding to the written data that does not complete the deduplication processing, the value of the preset identification bit is a first preset value, and the apparatus includes:
a reading unit, configured to read logical address mapping records from the logical address mapping table in sequence;
the judging unit is used for judging whether the value of the preset identification bit in the read logical address mapping record is a first preset value or not;
a deduplication processing unit, configured to perform deduplication processing on written data corresponding to the logical address mapping based on a preset deduplication policy if a value of the identification bit is a first preset value;
the marking unit is used for responding to a received write IO request, determining whether to perform deduplication processing on data to be written carried in the write IO request, and if the data to be written carried in the write IO request is subjected to deduplication processing based on a preset deduplication strategy, if the deduplication processing is not successful for the data to be written; or, if the data to be written carried in the write IO request is not subjected to deduplication processing, generating the logical address mapping record with the value of the preset identification bit being the first preset value for the data to be written in the logical address mapping table.
8. The apparatus of claim 7, wherein the storage device further comprises a pre-configured deduplication mapping table; the deleted mapping table comprises a plurality of deleted mapping records; wherein, the deduplication mapping record comprises a mapping relation between a hash value and a physical address of the written data which has completed deduplication processing;
the preset deduplication strategy comprises the following steps:
calculating a hash value of target data to be subjected to deduplication processing;
matching the calculated hash value with the hash value in the deduplication mapping table in sequence;
if the hash value is matched in the deduplication mapping table, reading written data which is corresponding to a physical address with a mapping relation with the hash value and has been subjected to deduplication processing;
comparing the target data with the read written data which has finished the deduplication processing;
and if the target data is different from the read written data which is subjected to the deduplication processing, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data as the first preset value.
9. The apparatus of claim 8, wherein the preset deduplication strategy further comprises:
and if the hash value is not matched in the deduplication mapping table, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data as a second preset value.
10. The apparatus of claim 8, wherein the preset deduplication strategy further comprises:
and if the target data is the same as the written data which has finished the deduplication processing, setting the value of the preset identification bit in the logical address mapping record corresponding to the target data to be a second preset value, and setting the physical address to be the physical address of the written data which has finished the deduplication processing.
11. The apparatus of claim 7, further comprising: the first trigger unit is used for counting the number of the logical address mapping records of which the preset identification bit values are the first preset values in the logical address mapping table; and judging whether the counted number of the logical address mapping records reaches a preset threshold value, and if so, executing the step of sequentially reading the logical address mapping records from the logical address mapping table.
12. The apparatus of claim 7, further comprising: the second trigger unit is used for counting the current IO performance indexes; judging whether the counted IO performance indexes respectively meet the preconfigured performance index threshold, if so, executing a step of performing deduplication on to-be-written data carried in the write IO request based on a preset deduplication strategy; and if not, performing deduplication processing on the data to be written carried in the write IO request.
CN201710239910.9A 2017-04-13 2017-04-13 Data deduplication method and device Active CN107122130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710239910.9A CN107122130B (en) 2017-04-13 2017-04-13 Data deduplication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710239910.9A CN107122130B (en) 2017-04-13 2017-04-13 Data deduplication method and device

Publications (2)

Publication Number Publication Date
CN107122130A CN107122130A (en) 2017-09-01
CN107122130B true CN107122130B (en) 2020-04-21

Family

ID=59724660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710239910.9A Active CN107122130B (en) 2017-04-13 2017-04-13 Data deduplication method and device

Country Status (1)

Country Link
CN (1) CN107122130B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107632944B (en) * 2017-09-22 2021-06-18 郑州云海信息技术有限公司 Method and device for reading data
CN110399096B (en) * 2019-06-25 2022-12-23 苏州浪潮智能科技有限公司 Method, device and equipment for deleting metadata cache of distributed file system again
CN110795031A (en) * 2019-10-17 2020-02-14 北京浪潮数据技术有限公司 Data deduplication method, device and system based on full flash storage
CN111443874B (en) * 2020-03-28 2021-07-27 华中科技大学 Solid-state disk memory cache management method and device based on content awareness and solid-state disk
CN113297105B (en) * 2021-05-08 2024-01-09 阿里巴巴新加坡控股有限公司 Cache processing method and device for converting address
CN113535708A (en) * 2021-09-17 2021-10-22 苏州浪潮智能科技有限公司 Data deduplication method, system, storage medium and equipment
CN115437579B (en) * 2022-11-04 2023-03-24 苏州浪潮智能科技有限公司 Metadata management method and device, computer equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630290A (en) * 2009-08-17 2010-01-20 成都市华为赛门铁克科技有限公司 Method and device of processing repeated data
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN104933010A (en) * 2014-03-18 2015-09-23 华为技术有限公司 Duplicated data deleting method and apparatus
CN106095332A (en) * 2016-06-01 2016-11-09 杭州宏杉科技有限公司 A kind of data heavily delete method and device
CN106527973A (en) * 2016-10-10 2017-03-22 杭州宏杉科技股份有限公司 A method and device for data deduplication

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120278371A1 (en) * 2011-04-28 2012-11-01 Luis Montalvo Method for uploading a file in an on-line storage system and corresponding on-line storage system
CN102799659B (en) * 2012-07-05 2015-01-21 广州鼎鼎信息科技有限公司 Overall repeating data deleting system and method based on non-centre distribution system
US9262430B2 (en) * 2012-11-22 2016-02-16 Kaminario Technologies Ltd. Deduplication in a storage system
CN105843551B (en) * 2015-01-29 2020-09-15 爱思开海力士有限公司 Data integrity and loss resistance in high performance and large capacity storage deduplication
US9965487B2 (en) * 2015-06-18 2018-05-08 International Business Machines Corporation Conversion of forms of user data segment IDs in a deduplication system
CN106528703A (en) * 2016-10-26 2017-03-22 杭州宏杉科技股份有限公司 Deduplication mode switching method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN101630290A (en) * 2009-08-17 2010-01-20 成都市华为赛门铁克科技有限公司 Method and device of processing repeated data
CN104933010A (en) * 2014-03-18 2015-09-23 华为技术有限公司 Duplicated data deleting method and apparatus
CN106095332A (en) * 2016-06-01 2016-11-09 杭州宏杉科技有限公司 A kind of data heavily delete method and device
CN106527973A (en) * 2016-10-10 2017-03-22 杭州宏杉科技股份有限公司 A method and device for data deduplication

Also Published As

Publication number Publication date
CN107122130A (en) 2017-09-01

Similar Documents

Publication Publication Date Title
CN107122130B (en) Data deduplication method and device
CN108459826B (en) Method and device for processing IO (input/output) request
US8760956B1 (en) Data processing method and apparatus
US9177028B2 (en) Deduplicating storage with enhanced frequent-block detection
US10303374B2 (en) Data check method and storage system
US11232073B2 (en) Method and apparatus for file compaction in key-value store system
CN105607867B (en) Master-slave deduplication storage system, method thereof, and computer-readable storage medium
WO2017185579A1 (en) Method and apparatus for data storage
JP2013509658A (en) Allocation of storage memory based on future usage estimates
CN105095116A (en) Cache replacing method, cache controller and processor
CN107209714A (en) The control method of distributed memory system and distributed memory system
CN110377233B (en) SSD (solid State disk) reading performance optimization method and device, computer equipment and storage medium
CN107193503B (en) Data deduplication method and storage device
CN111324303A (en) SSD garbage recycling method and device, computer equipment and storage medium
CN109407985B (en) Data management method and related device
CN112631953A (en) TRIM method and device for solid state disk data, electronic equipment and storage medium
CN110928496B (en) Data processing method and device on multi-control storage system
CN105389128B (en) A kind of solid state hard disk date storage method and storage control
CN104408126B (en) A kind of persistence wiring method of database, device and system
CN116340198B (en) Data writing method and device of solid state disk and solid state disk
CN108334457B (en) IO processing method and device
CN104298614A (en) Method for storing data block in memory device and memory device
CN114974365A (en) SSD (solid State disk) limited window data deduplication identification method and device and computer equipment
CN110658999B (en) Information updating method, device, equipment and computer readable storage medium
CN109871355B (en) Snapshot metadata storage method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant