WO2015100639A1

WO2015100639A1 - De-duplication method, apparatus and system

Info

Publication number: WO2015100639A1
Application number: PCT/CN2013/091170
Authority: WO
Inventors: 薛迎春; 邵长庚
Original assignee: 华为技术有限公司
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2015-07-09
Also published as: CN104205097A; CN104205097B

Abstract

Provided is a data processing technology, which is used for data de-duplication and fingerprint detection on a plurality of data blocks. When the fingerprints of the plurality of data blocks are the same, the stability of a stripe group where the data blocks are located is further detected, and the data blocks with high stability are retained, or when the reliability of the stripe group where the data blocks are located is not high, data recovery is carried out on the stripe group where the data blocks are located. The reliability of the stripe group where the data blocks are located may be increased, and data security is improved.

Description

De-weighting method device and system

The present invention relates to the field of storage, and in particular to a deduplication technology. Background technique

In the storage field, in order to save storage space, de-duplicate is a technology that is often used. If you need to store multiple copies of the same data, only one of them is stored. Data duplicated with this data is no longer stored. That is to say, repeated data is deleted, so this technique is also called deduplication.

In the choice of granularity, the file can be split into data blocks, with the data block as the basic unit of deduplication. When a data block is used as a basic unit for deduplication, each data block can be fingerprinted, and the fingerprint and the content of the data block are strongly correlated. When the fingerprints of the two data blocks are the same, we can conclude that the contents of the two data blocks are the same. By performing the deduplication operation, only one of the data blocks is stored in the storage system, and the other data block is not stored.

However, deduplication technology also brings about a problem of reduced data security. If only one of the data is damaged due to a storage system failure, the security of the data may be greatly reduced or the data may be permanently lost. Summary of the invention

The invention can improve data security.

In a first aspect, the present invention provides a data processing method for a controller, where the method includes: when there are multiple data blocks having the same fingerprint, querying an address of the data block according to the fingerprint, according to the The data block address is used to search for a stripe group occupied by the data block; verifying the independent stripped disk redundant array RAID stripe state of the stripe group of the data block, and saving the stripe consistent strip according to the check result The data block in the group; the data block in the group of the stripe group that is consistent according to the check result, including at least one of the following: if there is a stripe group with consistent strips, and there is a non-striped consistent The stripe group in the stripe group keeps the data blocks in the stripe group consistent with the stripe, and deletes the data blocks in the remaining stripe group; if there is no stripe group that is consistent in stripe, and exists Downgraded The stripe group is repaired by the RAID algorithm to repair the degraded stripe group into a stripe group with the same stripe group. The stripe group with the consistent stripe stores the data block, and deletes the remaining stripe group. The data block.

In a second aspect, the present invention provides a data block processing apparatus, the device comprising: a fingerprint matching module, configured to compare fingerprints of data blocks in a storage device; and an address finding module, when there are multiple When the data block has the same fingerprint, the address of the plurality of data blocks is queried according to the fingerprint, and the stripe group occupied by the plurality of data blocks is searched according to the plurality of data block addresses; the consistency check module, An independent low-cost redundant array RAID stripe state of the stripe group for verifying the plurality of data blocks, and storing data blocks in a stripe-consistent stripe group according to a check result; The result is to save the data block in the stripe group with the same stripe, including at least one of the following: if there is a stripe group with the same stripe, and there is a strip group that is not stripe consistent, the stripe is retained Consistent data blocks in the strip group, deleting data blocks in the remaining strip groups; if there is no stripe group with consistent strips, and there is a degraded strip group, then through RAID Points of the set of repair method to repair a degraded stripe uniform stripe group, the stripe group consistent striping the data block is stored, deleting the remaining data blocks in the stripe set.

In a third aspect, the present invention provides a data block processing method, which is used in a controller, where the method includes: querying a fingerprint database of a data block in a storage device, where the data block in the storage device and the data block to be stored have the same fingerprint When the data block is stored, the independent low-cost redundant array RAID stripe state of the stripe group in which the stored data block is located is detected; and the data block storage is performed according to the detection, including: if the stripe state of the stripe group is If the stripe is consistent, the data block to be stored is not stored; if the stripe state of the stripe group is stripped down, then: storing the to-be-stored data block, deleting the stored data block; or, storing according to The RAID algorithm repairs the stored data block by the degraded stripe group; or, if the stripe state of the stripe group is inconsistent, storing: the data block to be stored, deleting the The data block is stored; or, if no data error occurs in the data striping unit, the stored data block is repaired according to a RAID algorithm.

In a fourth aspect, the present invention provides a data block processing apparatus, the device comprising: a query module 61, configured to query a fingerprint database of a data block in a storage device; and a consistency check module 62, configured to be a storage device When there is a stored data block having the same fingerprint as the data block to be stored, detecting a RAID array state of the independent inexpensive disk of the striped group in which the stored data block is located; storing the data block according to the detection, including If the stripe state of the stripe group is consistent, the data block to be stored is not stored; if the stripe state of the stripe group is stripped down, then: storing the to-be-stored data block Deleting the stored data block; or storing the stored data block by the degraded stripe group according to a RAID algorithm; or, if the stripe state of the stripe group is inconsistent, And storing the stored data block, or deleting the stored data block according to a RAID algorithm;

The solution of the invention can perform stability detection on the stripe group in which the data block is located when the data block is de-duplicated, and improve the stability of the strip group in the data block by re-storing or repairing when the stability is insufficient. Data security. DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments will be briefly described below. The drawings in the following description are only some embodiments of the present invention, and can also be obtained according to the drawings. Other drawings.

1 is an example of a topology diagram of an application scenario of the present invention;

2 is a flowchart of a data block processing method according to an embodiment of the present invention;

3 is a flowchart of a data block processing method according to an embodiment of the present invention;

4 is a schematic diagram of an embodiment of a data block processing apparatus;

FIG. 5 is a schematic diagram of an embodiment of a data block processing apparatus. detailed description

The technical solutions of the present invention will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained based on the embodiments of the present invention are within the scope of the present invention.

A storage system usually consists of a controller and a storage device. The controller is equivalent to a computer. Including the processor and memory, you can manage the storage device and provide interfaces to the host and storage devices. The storage device provides a physical storage space. The storage device may be composed of, for example, a Solid State Disc (SSD) and a Serial Attached SCSI (SAS) disk. When writing data, the host writes the data request to the controller first, and then the controller allocates the storage space in the storage device for the write request, and sends the data to be written carried in the write request to the storage device for storage. When reading data, the data is read from the storage device into the controller and then sent to the host by the controller. The controller and storage device can be physically separate devices, or the storage device can be integrated into the controller. When the storage device is integrated in the controller, the data interaction between the controller and the storage device becomes a data interaction inside the controller, and the controller can also be referred to as a storage server.

In another topology, the controller provides management for not providing data transfer, and the data exchange between the storage device and the host may be performed without the controller, and the calculation solution provided by the embodiment of the present invention may also be applied.

Deduplication technology can be divided into online (On-line) and offline (Off-line). The online mode has higher space utilization for storage devices; offline data writes faster.

In online mode, the controller receives a write request, and the write request carries a new data block. Before the new data block is written to the storage device, the controller checks whether the same data block already exists in the storage device, and if not, stores the new data block to the storage device; if it exists, the new data block is no longer stored. And establish an index relationship between the LU that owns this new data block and the existing data block. This index relationship can be, for example, a pointer, and when the new data block needs to be read later, the existing data block can be read by the pointer.

In offline mode, after the controller receives a write request, the data block is first stored in the storage device regardless of whether the storage device already stores the same data block. The deduplication operation is then performed periodically, or the deduplication operation is performed when the storage device is idle. During the de-duplication operation, only one copy of the duplicate data block is reserved, freeing up the storage space occupied by the duplicate data block. Point the LUs that point to these data blocks to the one that is reserved.

Since only one copy of the duplicate data is kept, the security of the only piece of data is particularly important. Storage system can be independent of cheap disk redundancy 'J (Redundant Arrays of Independent Disks, RAID) to improve its data security. However, the protection that RAID can provide is limited. When the reliability of a RAID strip is reduced, the security of the data stored therein is reduced.

In RAID technology, Stripe consists of several Stripe Units (SUs). The SUs that make up the same stripe can belong to different physical memories. Stripes are also called strips. SUs belonging to the same stripe can have the same size of storage space. For ease of management, SUs belonging to the same stripe can also have the same offset, meaning that they are in the same location in different memories. For example, for RAID5 or RAID6, SU can be divided into data SU and check SU. Data SU is used to store service data. Verify SU is used to store check data of service data. Verify SU can also be called redundant SU. An integer number of strips can form a logical unit (LU), and LU can be used as a host-oriented logical storage unit. Conventionally, a logical unit is also called a Logic Unit Number (LU), which is used in the present invention. Convention.

As shown in FIG. 1, it is an example of a topology diagram of an application scenario of the present invention. Controller 1 and storage device 2 are connected to form a storage system. The controller 1 is composed of a process 11 and a cache 12, in which a computer instruction is stored in the cache 12, and the processor 11 executes computer instructions to perform corresponding operations on the storage device to complete the present invention. The storage device 2 is composed of a plurality of memories 21, each of which provides one stripe unit SU to form a stripe 211. When the data is stored, the host sends the data to the controller 1, and the controller stores the data in the stripe of the storage device. A data block occupies an integer number of stripes, and controller 1 can de-duplicate the data at the granularity of the data block.

An error or loss in the data in the SU is called a SU fault. In a stripe, when one of the SUs fails, the RAID algorithm can be used to recover the data in the faulty SU with the data stored in the unfaulted SU. This process of restoring data is called stripping repair. . Some RAID algorithms can recover data for a single SU fault, and some RAID algorithms can recover data for a larger number of SU faults. The number of faults SU that can be recovered by the RAID algorithm is called the number of SUs that this stripe allows for faults. For example, RAID5 allows one SU fault, while RAID6 allows 2 SU faults.

The status of the stripe includes: the stripe is consistent; the stripe is downgraded; the stripe is inconsistent; the stripe is invalid. The reliability of the strips in these four states is sequentially reduced. The stripe consistency is normal, the data in all SUs in the strip is normal, that is to say, the data of each SU can be read out, and the check data calculated by the data of the data SU and the verification SU The data is the same. Inconsistent striping means that the data in each SU of the stripe can be read out, but the check data calculated from the data in the data SU is different from the data stored in the check SU. The reason for the inconsistency of the stripe may be that the data SU data is in error, or the data of the verify SU is in error, or there is an error in the data SU data and the verification SU data. Since the redundant data is stored in the verification SU, when only the data of the SU is erroneous and the data of the data SU is not erroneous, the data transmission is not considered to be lost, and the data of the verification SU can be recalculated by the data of the data SU.

Striping downgrade means that there is a fault SU in the strip, but the data in the fault SU can be recovered by means of the remaining SUs in the stripe. When striping allows more than one SU to fail, the striping degradation can be further subdivided into multiple levels. The more SUs with faults, the lower the reliability. For example, if you use RAID6 striping, one SU fault or two SU faults is called stripe downgrade, but when there are two strip faults, the data in the stripe is more secure than one stripe fault. Lower. If the stripe fails, it means that there is a fault in the SU, and the data in the fault SU cannot be recovered through the remaining SUs of the stripe, that is, some data in the stripe is permanently lost. If the SU fails, it means that there is a logical or physical error in the storage space of the SU. As a result, the data in the SU cannot be read or cannot be read completely.

A block is the basic unit of deduplication. A block of data is stored in one or more strips, and a block of data that stores the same stripe is called a stripe group. A stripe group consists of one strip or multiple strips. The data of the LUN consists of the data pointed to by the LUN.

In the process of performing deduplication, the embodiment of the present invention considers the reliability level of the strip in which the data block is located. When offline deduplication, when there are multiple identical data blocks, check the reliability level of the strips where each data block is located, retain the data blocks in the most reliable stripe, and delete the data blocks in the remaining strips. When the stripe state of all strips is not consistent, that is, each stripe has a certain degree of reliability degradation, the data block can be written into the strips with the same stripe, and the remaining strips are deleted. The data. When online deduplication, if the same data block as the data block to be stored is already stored in the storage device, the reliability level of the stored data block is detected. If the striping status is not the same as the stripe, the data block is used. Write the stripe in the stripe consistently, and delete the stored data block.

When a data block occupies multiple strips, the reliability level of the lowest reliability stripe is used as the reliability level of the entire strip group. Applying the technology provided by the embodiment of the present invention, in the deduplication process, the reliability level of the stripe in which the data block is located is further considered, and the data block is stored in a highly reliable strip. Improve the security of the data block.

As shown in Figure 2, it is a flowchart of a data block processing method, which can perform offline deduplication. The method includes the following steps.

Step 31: The controller compares the fingerprints of the data blocks in the storage system.

Each data block has a fingerprint. For example, the data content of the data block can be calculated by using an algorithm such as MD5 or SHA 128, and the calculated result is used as a fingerprint of the data block. Data blocks with the same fingerprint are the same data block. The fingerprint of the data block in the storage device is stored in the fingerprint library of the controller.

Step 32: When there are multiple data blocks having the same fingerprint, query the addresses of all the data blocks that own the fingerprint according to the fingerprint, and search for the strip group occupied by the data block according to the data block address.

When multiple data blocks with the same fingerprint are found, one of the data blocks can be reserved, and the remaining data blocks can be deleted to reduce the storage space of the storage device. The controller stores a mapping table in which the storage addresses of the data blocks represented by the fingerprint and the fingerprint are recorded. The controller can find the stripe group that stores the data block based on the address of the data block. The data blocks are stored in a strip group, the data block and the stripe group - corresponding. The data block address can be expressed as an offset of the LU, which can be converted into a physical address, and the location information of the stripe group in which the data block is located can be a physical address.

Step 33: Check the RAID stripe state of the independent inexpensive disk redundant array of the stripe group of the data block, and save the data block in the stripe group with the same stripe according to the check result.

And saving, according to the verification result, the data block in the stripe group that is consistent in the stripe, including at least one of the following: if the stripe group that has the same stripe already exists, and the sub-segment is consistent For the strip group, the data block in the strip group in which the strips are consistent remains unchanged, and the remaining strips are deleted. The data block in the group; if there is no stripe group with the same stripe, and there is a degraded strip group, the RAID algorithm is used to repair the degraded stripe group, and the stripe group with the same stripe is generated, and the remaining strips are deleted. The data block in the strip group. The stripe status is verified for each strip group. The status of the strip group is determined by the least reliable stripe in the strip group.

The repair in the embodiment of the present invention refers to that when some SUs in the stripe have data errors or SU faults, the data in the normal SU is used to use the RAID check algorithm to recalculate the data in the SU with the fault SU or the data error. , The data in the normal SU and the recalculated data are newly written into the stripe group of the storage device. The stripe status of the repaired stripe group is consistent. The security of the data block in the stripe group is higher than that of the stripe degrading and striping. After the repaired data block is specifically written to the location, the stripe group can be reassigned. If the normal read and write can be satisfied, the original stripe group can also be written to overwrite the data in the original stripe group.

A policy provided by the embodiment of the present invention is: when the stripe state of any stripe group is consistent, the data block in the stripe is reliable, and the other stripe groups can no longer perform the stripe state. The verification is directly deleted.

In the existing deduplication technology, since the stripe check is not performed, if there are both the stripe group in which the strips are consistent, and the stripe group in which the strips are not consistent. Then the data blocks in one of the strip groups are randomly reserved. If the data block in the stripe group that has failed to be stripped is retained, and the other stripe groups are cleared, the contents of the data block cannot be read normally, that is, the deduplication operation causes data loss. Non-striped consistent groupings include states other than stripping consistent, such as striping inconsistencies, striping downgrades, or striping failures. From the statistical results, in the case of multiple deduplication operations, the prior art may result in data loss or data security degradation. The embodiment of the present invention can improve data reliability as compared with the prior art.

The embodiment of the present invention further provides another strategy: if there is no stripe group with the same stripe, and there is a degraded strip group, the degraded stripe group is repaired into a stripe-consistent strip by a RAID algorithm. Group, delete the data blocks in the remaining stripe groups.

In the existing deduplication technology, since the stripe group is not subjected to stripe check, it is impossible to know the reliability of the strip group, and the deduplicated data block may be stored in the degraded strip group, or other stable Qualitatively low in the group of strips. In this embodiment of the present invention, for the case, the degraded stripe group may be repaired, so that the finally retained data blocks are stored in the stripe group in a consistent manner. This increases data security.

It should be noted that the two strategies are independent and can be performed arbitrarily, which can improve the data security of the storage device statistically. Therefore, the device or controller that implements this method can support both strategies, or only one of them.

The embodiment of the present invention further provides an optional policy: if all the stripe groups are invalid stripe groups, the data in the stripe consistent in the stripe group is deduplicated, and the deletion is invalid. The data in the strip. Like the first two strategies, this strategy is also independent. For any device or controller that implements this method, either of these strategies, or any two or three of them can be supported.

When all the stripe groups are invalid stripe groups, it means that each stripe group has permanent data loss, and relying on a single stripe group is not enough to recover the entire data block. The strip group consists of strips, and embodiments of the present invention can save data in a portion of the strip. The salvation strips may be able to form a complete group of strips. Even if the salvage strip group is not enough to form a complete strip group, it is still meaningful to retain the data in these strips, for example, in the future. When new data is written, the newly written data and the stored strips can be combined into a complete stripe group. This strategy therefore avoids or reduces data loss in the data block.

The measures for retaining the stripe include: If there are stripe-consistent strips, the strips that are consistent with each stripe are deduplicated; if there are degraded strips, and there are no strips that are degraded and stored, the strips of the same data are stored. Consistent striping, the RAID algorithm is used to repair the degraded stripe; if there are inconsistent strips, and the inconsistent stripe data SU does not have an error, and there is no consistent and consistent stripe storing the same data. The stripe or the degraded stripe uses a RAID algorithm to fix the degraded stripe.

Step 34: Point the LU pointing to the strip group in which the data block is located to the stripe group with the stripe consistent.

The LU is managed by the controller and is provided for host use. The controller records the stripe group pointed to by the LUN. The data blocks in the stripe group form the data of the LUN. When the host reads the data, the data block stored in the LUN can be found through the pointing relationship between the LUN and the stripe group. In the process of deduplicating data, some data blocks are deleted, and the reserved data blocks are shared by these LUNs. Therefore, it is necessary to change the LUs pointing to the group of the deleted data blocks to the stripe-consistent strips. group.

According to the different situations described in step 33, when there is a stripe-consistent stripe group in the execution of step 33, the LUs point to the originally existing stripe group; if there is no stripe-consistent strip group When repairing sparsely generated stripe groups, these LUs point to the consistent stripe group generated by the repair.

Optionally, a mapping table of the fingerprint recorded in the controller and the storage address of the data block represented by the fingerprint is further updated. Update the address stored in the data block to a stripe group that points to a consistent strip. In the next deduplication, you can use this correspondence to find the stripe group in which the data block is located, and reconfirm the stripe status of the stripe group.

Step 35: Optionally, the number of references of the data block may also be updated, and the number of times of the reference number increase is equal to the number of deleted data blocks. The controller records the number of times the data block is referenced, and the number of references is used to describe the number of LUs that point to this data block. When the number of references is 0, it means that no LU needs to use this data block, and this data block can be deleted. Steps 35 and 34 are not limited to the order of execution, and may be performed either first or both.

As shown in Figure 3, it is a flow chart of the data block processing method, which can be deleted online. The method includes the following steps. The method can be performed by a controller, in particular a processor of the controller executing computer instructions in the cache.

Step 41: Query a fingerprint database of a data block in the storage device. When there is a stored data block in the storage device that has the same fingerprint as the data block to be stored, detecting the independent cheap disk redundancy of the group of the stored data block The remaining array RAID stripe status.

Before storing the data block to be stored in the storage device, the controller first calculates the fingerprint of the data block to be stored, and then checks whether the fingerprint of the data block to be stored already exists in the fingerprint database. If it does not exist, it means that the data block to be stored is not stored, and the data block to be stored needs to be stored in the storage device. If it exists, it means that this data block has been stored, and further judge whether it needs to be re-stored. The data block to be stored.

Step 42: Check the RAID stripe status of the stripe group in which the stored data block is located. If the stripe is inconsistent, you need to generate a stripe group with the same stripe. Specifically, the stored data block can be replaced by the data block to be stored, and the stripe group in which the data block to be stored is stored is consistently divided into groups; if the strip group in which the data block is stored can be repaired, Fix the stripe group where the stored data block is located. If the stripe state is consistent, there is no need to store the data block to be stored, and the stored data block does not need to be changed.

The policy is specifically described below, and the embodiment of the present invention may include any one or more of the following strategies. The controller applying the embodiment of the present invention may have a function supporting multiple policies at the same time, or may only support one of the policies.

Policy A. If the stripe status of the stripe group in which the stored data block is located is consistent, the data block to be stored is not stored. Of course, it is also possible to replace the stored data block with the block of data to be stored.

If the stripe state of the stripe group is a stripe downgrade, the data block to be stored is stored, the stored data block is deleted, or the storage is performed by the degraded stripe group according to a RAID algorithm. The data block has been stored for repair. In the degraded group, the data in the fault SU can be repaired by the RAID algorithm.

The policy C, if the stripe state of the stripe group is inconsistent, stores the to-be-stored data block, and deletes the stored data block. Or when it is determined that there is no data SU failure, that is, when the fault SU is all the verification SU, the strip group is repaired. The specific repair method is to recalculate the data in the verification SU according to the RAID algorithm, and then write the data in the original data SU and the data in the verification SU obtained by the recalculation into the stripe group of the storage device. It can be the sub-group of the original data SU, or it can be the sub-group of re-application.

Determining whether there is data SU is faulty: Calculate the fingerprint of the data in the data SU. If the fingerprint and the fingerprint of the data block to be stored are the same, it indicates that data SU has not been corrupted. The fingerprint can be calculated without using the check data.

Step 43: Point the LU where the data block to be stored is located to the stripe group with the same stripe. After the processing in step 42, the storage device stores a stripe group with the same stripe, and the data block stored in the stripe group with the same stripe is the same as the data block to be stored. According to the different strategies of step 42, if the stripe state of the stored data block is consistent, then the LU points to the stripe group in which the stored data block is located; if the stored data block is replaced with the to-be-stored data block, the LU points to be stored. The stripe group in which the data block is stored after the storage device; If the stripe group of the stored data block is repaired, the LU points to the repaired stripe group.

Optionally, the mapping table of the storage address of the data block represented by the fingerprint and the fingerprint recorded in the controller is further updated. Update the address stored in the data block to a stripe group that points to a consistent strip. In the next deduplication, you can use this correspondence to find the stripe group in which the data block is located, and confirm the stripe status again.

Step 44, optionally, update the reference number of data blocks. Increase the number of references to stored data blocks by 1.

As shown in FIG. 4, it is a schematic diagram of a data block processing device. The block processing device 5 includes a fingerprint matching module 51, an address lookup module 52, a parity check module 53, and an index module 54. A counting module 55 can also be included.

The fingerprint matching module 51 is configured to compare fingerprints of data blocks in the storage system.

Each data block has a fingerprint. For example, the data content of the data block can be calculated by using an algorithm such as MD5 or SHA 128, and the calculated result is used as a fingerprint of the data block. Data blocks with the same fingerprint are the same data block. The fingerprint of the data block in the storage device can be stored in the fingerprint library of the controller.

The address finding module 52 is configured to: when a plurality of data blocks have the same fingerprint, query an address of the data block according to the fingerprint, and search for a stripe group occupied by the data block according to the data block address.

When multiple data blocks with the same fingerprint are found, one of the data blocks can be reserved, and the remaining data blocks can be deleted to reduce the storage space of the storage device. The controller stores a mapping table in which the storage address of the data block represented by the fingerprint and the fingerprint is recorded. The controller can find the stripe group that stores the data block based on the address of the data block. Data blocks are stored in strip groups, data blocks and strip groups - corresponding. The consistency check module 53 is configured to check the RAID state of the independent inexpensive disk redundancy array of the stripe group of the data block, and save the data block in the stripe group with consistent stripes according to the verification result. .

The consistency check module 53 can also be used to repair a strip group. And saving, according to the verification result, the data block in the stripe group that is consistent in the stripe, including at least one of the following: if the stripe group that has the same stripe already exists, and the sub-segment is consistent The strip group keeps the data blocks in the strip group that are consistent in the stripe unchanged, and deletes the data blocks in the remaining strip group; if there is no stripe group in which the strips are consistent, and there is degradation In the stripe group, the degraded stripe group is repaired by the RAID algorithm, and the stripe group with the same stripe is generated, and the data blocks in the remaining stripe groups are deleted. The stripe status is verified for each stripe group. The status of the strip group is the least reliable stripe in the strip group.

In the existing deduplication technology, since the stripe check is not performed, if there are both the stripe group in which the strips are consistent, and the stripe group in which the strips are not consistent. Then the data blocks in one of the strip groups are randomly reserved. If the data block in the stripe group that has failed to be stripped is retained, and the other stripe groups are cleared, the contents of the data block cannot be read normally, that is, the deduplication operation causes data loss. Non-striped consistent groupings include states other than stripping consistent, such as striping inconsistencies, striping downgrades, or striping failures. From the statistical results, in the multiple deduplication operations, the prior art may cause data loss or data security degradation. However, the embodiment of the present invention can be improved compared to the prior art. Data reliability.

In the existing deduplication technology, since the stripe group is not subjected to stripe check, it is impossible to know the reliability of the strip group, and the deduplicated data block may be stored in the degraded strip group, or other stable. In the low-scoring group. In this embodiment of the present invention, for the case, the degraded stripe group may be repaired, so that the finally retained data blocks are stored in the stripe group in a consistent manner. This increases data security.

The measures for retaining the stripe include: If there are stripe-consistent strips, the strips that are consistent with each stripe are deduplicated; if there are degraded strips, and there are no strips that are degraded and stored, the strips of the same data are stored. Consistent striping, use RAID algorithm to fix degraded stripe; if there is inconsistency The stripe, and the inconsistent stripe data does not have an error, and there is no stripe or degraded stripe that stores the same data in the inconsistent stripe, and the RAID algorithm is used to repair the degraded stripe.

The indexing module 54 is connected to the consistency checking module 53 for pointing the LUs pointing to the group of the data blocks in the group to the stripe groups in which the strips are consistent.

The LU is managed by the controller and is provided for host use. The controller records the stripe group that the LUN points to. The data blocks in the stripe group form the data of the LUN. When the host reads the data, the data block stored in the LUN can be found through the LUN and the stripe group. In the process of deduplicating data, some data blocks are deleted, and the reserved data blocks are shared by these LUNs. Therefore, it is necessary to change the LUs pointing to the group of the deleted data blocks to the stripe-consistent strips. group.

If there is a stripe-consistent stripe group in the storage device, the indexing module 54 points the LUs to the originally existing stripe group; if there is no stripe group with the same stripe, the repair generates the stripe-consistent When the group is striped, the indexing module 54 points these LUs to the stripe group that is consistently generated by the repair.

Optionally, the indexing module 54 further updates the mapping table of the fingerprint recorded in the controller and the storage address of the data block represented by the fingerprint. Update the address stored in the data block to a stripe group that points to a consistent strip. In the next deduplication, you can use this correspondence to find the strip group in which the data block is located, and confirm the striping status of the strip group again.

The counting module 55 is connected to the consistency checking module 53 to update the reference number of the data block, and the number of times of reference increase is equal to the number of deleted data blocks. The controller records the number of times the data block is referenced, and the number of references is used to describe the number of LUs that point to this data block. When the number of references is 0, it means that no LU needs to use this data block, and this data block can be deleted.

As shown in FIG. 5, it is a schematic diagram of a data block processing device, which can perform online deduplication. The data block processing device 6 includes: a query module 61, a test result module 62, and an index module 63. A counting module 64 can also be included.

The query module 61 is configured to query a fingerprint database of the data block in the storage device, and determine whether there is a stored data block in the storage device that has the same fingerprint as the data block to be stored. Before storing the data block to be stored in the storage device, the controller first calculates the fingerprint of the data block to be stored, and then checks whether the fingerprint of the data block to be stored already exists in the fingerprint database. If it does not exist, it means that the data block to be stored is not stored, and the data block to be stored needs to be stored in the storage device. If it exists, it means that this data block has been stored, and further determines whether it is necessary to re-store the data block to be stored.

The consistency check module 62 is configured to detect the RAID stripe status of the stripe group in which the stored data block is located. If the stripe is inconsistent, a stripe group with the same stripe needs to be generated. Specifically, the stored data block can be replaced by the data block to be stored, and the stripe group in which the data block to be stored is stored is consistently divided into groups; if the strip group in which the data block is stored can be repaired, Fix the stripe group where the stored data block is located. If the stripe state is consistent, there is no need to store the data block to be stored, and the stored data block does not need to be changed.

The consistency check module 62 also has the function of repairing the stripe group or striping. The processing strategy of the consistency check module 62 is specifically described below. The embodiment of the present invention may include any one or more of the following strategies. The controller of the embodiment of the present invention may have the function of supporting multiple policies at the same time, or may only support one of the policies.

The policy C, if the stripe state of the stripe group is inconsistent, stores the to-be-stored data block, and deletes the stored data block. Or, when it is determined that there is no data SU failure, that is, the fault SU is all the verification SU, the strip group is repaired. The specific repair method is to recalculate the data in the verification SU according to the RAID algorithm, and then write the data in the original data SU and the data in the verification SU obtained by the recalculation into the stripe group of the storage device. Can be the original data SU is in the strip Group, or a re-apply group.

The indexing module 63 is configured to point the LUN where the data block to be stored is located to the stripe group with the same stripe.

After the consistency check module 62 processes, the storage device stores a stripe group with the same stripe. The data block stored in the stripe group with the same stripe is the same as the data block to be stored. According to the different strategies of step 42, if the stripe state of the stored data block is consistent, then the LU points to the stripe group in which the stored data block is located; if the stored data block is replaced with the to-be-stored data block, the LU points to be stored. The stripe group in which the data block is stored after the storage device; If the stripe group of the stored data block is repaired, the LU points to the repaired stripe group.

Optionally, the indexing module 63 is further configured to update a mapping table of the storage addresses of the data blocks represented by the fingerprint and the fingerprint recorded in the controller. Update the address stored in the data block to a stripe group that points to a consistent strip. The next time you perform deduplication, you can use this correspondence to find the stripe group in which the data block is located, and confirm the stripe status again.

Counting module 64, used to update the number of times the data block is referenced. Increase the number of references to stored data blocks by 1. Counting module 64 is optional.

The embodiment of the present invention may further provide a data block processing system, including a data block processing device 5 and a storage device, where the storage device is configured to store a data block. The block processing device can be a controller or software or hardware integrated in the controller.

The embodiment of the present invention may further provide a data block processing system, including a data block processing device 6 and a storage device, where the storage device is configured to store data blocks. The block processing device can be a controller or software or hardware integrated in the controller.

Those of ordinary skill in the art will appreciate that various aspects of the present invention, or possible implementations of various aspects, may be embodied as a system, method, or computer program product. Thus, aspects of the invention, or possible implementations of various aspects, may employ an entirely hardware embodiment, full software Embodiments (including firmware, resident software, etc.), or a combination of software and hardware aspects, are collectively referred to herein as "circuits,""modules," or "systems." Furthermore, aspects of the invention, or possible implementations of various aspects, may take the form of a computer program product, which is a computer readable program code stored on a computer readable medium.

The computer readable medium can be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium includes, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing, such as random access memory (RAM), read only memory (ROM), Erase programmable read-only memory (EPROM or flash memory), optical fiber, portable read-only memory (CD-ROM:).

The processor in the computer reads the computer readable program code stored in the computer readable medium, such that the processor can perform the functional actions specified in each step or combination of steps in the flowchart; A device that functions as specified in each block, or combination of blocks.

Claims

claims

1. A data processing method used in a controller, characterized in that the method includes: when multiple data blocks have the same fingerprint, query the address of the data block based on the fingerprint, and query the address of the data block based on the fingerprint. The block address looks up the stripe group occupied by the data block;

Verify the redundant array of independent inexpensive disks RAID stripe status of the stripe group of the data block, and save the data blocks in the stripe group with consistent stripes according to the verification results;

Saving the data blocks in the stripe groups with consistent stripes according to the verification results includes at least one of the following: if there is the stripe group with consistent stripes, and there is the stripe group with non-consistent stripes. , then retain the data blocks in the stripe group with consistent stripes, and delete the remaining data blocks in the stripe group;

If the striping group with consistent striping does not exist, and a degraded striping group exists, the degraded striping group is repaired into a striping group with consistent striping through RAID algorithm repair, and the striping group with consistent striping is The data blocks are stored in the group, and the data blocks in the remaining striping groups are deleted.

2. The data processing method according to claim 1, characterized in that, storing the data blocks in the stripe groups with consistent stripes according to the verification results further includes:

If all the stripe groups are invalid stripe groups, then the data in the consistent stripes in the invalid stripe group are deduplicated, and the data in the invalid stripes in the invalid stripe group are deleted.

3. The data processing method according to claim 1, characterized in that the method further includes:

Point the LU pointing to the stripe group in which the data block is located to the stripe group with which the stripe is consistent.

4. The data processing method according to claim 1, characterized in that the method further includes:

The number of references of the data block is updated, and the updated number of references is equal to the original count value plus the number of deleted data blocks.

5. A data block processing device, characterized in that the device includes:

The fingerprint comparison module 51 is used to compare the fingerprints of data blocks in the storage device;

Address search module 52, when there are multiple data blocks in the storage device with the same fingerprint, according to The fingerprint queries the addresses of the plurality of data blocks, and searches the stripe groups occupied by the plurality of data blocks according to the addresses of the plurality of data blocks;

Consistency check module 53, used to verify the independent cheap disk redundant array RAID stripe status of the stripe group of the multiple data blocks, and save the data in the stripe group with consistent stripes according to the verification result. block; the data blocks in the stripe group that are consistent in striping are saved according to the verification results, including at least one of the following: if there is a stripe group that is consistent in stripes, and there is a non-consistent stripe in the stripe group; stripe group, retain the data blocks in the stripe group with consistent stripes, and delete the remaining data blocks in the stripe group;

If the striping group with consistent striping does not exist, and a degraded striping group exists, the degraded striping group is repaired into a striping group with consistent striping through RAID algorithm repair, and the striping group is consistent with the striping group. The data blocks are stored in the group, and the data blocks in the remaining striping groups are deleted.

6. The data processing device according to claim 5, characterized in that: saving the data blocks in the stripe groups with consistent stripes according to the verification results further includes:

7. The data processing device according to claim 5, characterized in that the device further includes:

The index module 54 is configured to point the LU pointing to the stripe group where the data block is located to the stripe group with the same stripe.

8. The data processing device according to claim 5, characterized in that the device further includes:

Counting module 55 is used to update the count value of the data block. The updated count value is equal to the original count value plus the number of deleted data blocks.

9. A storage system, including the storage device of any one of claims 6-8, and a storage device.

10. A data block processing method used in a controller, characterized in that the method includes: querying the fingerprint library of data blocks in a storage device, when the data block to be stored exists in the storage device When there are stored data blocks with the same fingerprint, detect the independent cheap disk redundant array RAID stripe status of the stripe group where the stored data block is located;

Data block storage based on detection, including:

If the striping status of the striping group is striping consistent, then the data block to be stored is not stored; if the striping status of the striping group is striping downgraded, then: the data block to be stored is stored, Delete the stored data block; or, store and repair the stored data block by the degraded stripe group according to the RAID algorithm; or

If the striping status of the striping group is inconsistent, then: store the data block to be stored and delete the stored data block; or, if no data error occurs in the data striping unit, perform the striping according to the RAID algorithm. The stored data blocks are repaired.

11. The data block processing method according to claim 10, characterized in that, the method further includes:

Point the LU to which the data block to be stored belongs to the data block stored based on the detection.

12. The data block processing method according to claim 10, characterized in that the method further includes:

Increase the reference count of the stored data block by 1.

13. A data block processing device, characterized in that the device includes:

Query module 61, used to query the fingerprint database of data blocks in the storage device;

The consistency check module 62 is used to detect the independent cheap disk redundant array RAID of the stripe group where the stored data block is located when there is a stored data block with the same fingerprint as the data block to be stored in the storage device. striping status;

Data block storage based on detection, including:

If the striping status of the striping group is inconsistent, then: store the data to be stored block, delete the stored data block; or, if no data error occurs in the data striping unit, repair the stored data block according to the RAID algorithm.

14. The data block processing device according to claim 13, characterized in that the device further includes:

The index module 63 is used to point the LU to which the data block to be stored belongs to the data block stored according to the detection.

15. The data block processing method according to claim 13, characterized in that the device further includes:

The counting module 64 is used to add 1 to the number of references to the stored data block.

16. A storage system, including the storage device of any one of claims 13-15, and a storage device.

17. A controller connected to a storage device. The controller includes a processor and a cache. The cache is used to store computer instructions. The controller performs the following steps by running the computer instructions:

When there are multiple data blocks with the same fingerprint, query the address of the data block based on the fingerprint, and search the stripe group occupied by the data block based on the data block address;

18. The controller according to claim 17, wherein said saving the data blocks in the stripe groups with consistent stripes according to the verification results further includes: If all the stripe groups are invalid stripe groups, then the data in the consistent stripes in the invalid stripe group are deduplicated, and the data in the invalid stripes in the invalid stripe group are deleted.

19. The controller according to claim 17, wherein the processor is further configured to: point the LU pointing to the stripe group where the data block is located to the stripe group with the same stripe.

20. The controller according to claim 17, wherein the processor is further configured to: update the number of references of the data block, and the updated number of references is equal to the original count value plus the deleted data. The number of blocks.

21. A controller connected to a storage device. The controller includes a processor and a cache. The cache is used to store computer instructions. The controller performs the following steps by running the computer instructions: Query data in the storage device. Block fingerprint database, when there is a stored data block with the same fingerprint as the data block to be stored in the storage device, detect the independent cheap disk redundant array RAID stripe status of the stripe group where the stored data block is located;

Data block storage based on detection, including:

22. The controller according to claim 21, wherein the processor is further configured to: point the LU to which the data block to be stored belongs to the data block stored according to the detection.

23. The controller according to claim 21, wherein the processor is further configured to: increase the number of references of the stored data block by 1.