CN108170375B

CN108170375B - Overrun protection method and device in distributed storage system

Info

Publication number: CN108170375B
Application number: CN201711389565.3A
Authority: CN
Inventors: 曹锡韬
Original assignee: Innovation Technology Co ltd
Current assignee: Innovation Technology Co ltd
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2020-12-18
Anticipated expiration: 2037-12-21
Also published as: CN108170375A

Abstract

The invention provides an overrun protection method and device in a distributed storage system, wherein the method comprises the following steps: when the number of the fault disks in the distributed storage system exceeds a redundancy capacity threshold value m of the distributed storage system, selecting m fault disks from the fault disks as common offline disks, and taking the rest fault disks as overrun disks; if the permanently damaged overrun disks do not exist in all the overrun disks, suspending upper layer read/write, storing the cache data of all the current strips, and entering an overrun waiting process; in the overrun waiting process, for each overrun disk of fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk; and if all the overrun disks are recovered, activating the upper layer read/write, and finishing the overrun waiting process.

Description

Overrun protection method and device in distributed storage system

Technical Field

The present invention relates to the field of distributed storage technologies, and in particular, to an overrun protection method and apparatus in a distributed storage system.

Background

A distributed storage system typically encodes user data according to some redundancy algorithm into data sets with redundancy characteristics that are stored scattered across multiple physical media on multiple nodes. The more common redundancy algorithms are RAID5, RAID6, RAID7, etc., and these redundancy algorithms are all used to stripe user data and add check data blocks in the stripe, and a certain mathematical formula is satisfied between the user data blocks and the check data blocks. After part of user data is invalid, the content can still be calculated by the residual data, thereby improving the reliability of the data.

However, redundancy algorithms themselves have varying degrees of protection limitations, e.g., RAID5 allows one disk to fail, RAID6 allows two disks to fail … …, and if the failure range exceeds the redundancy capabilities of the redundancy algorithm itself, there is not enough data to satisfy the mathematical calculations, resulting in data not being recoverable. Wherein, for the reading operation, if the data on the fault disk is to be read, the reading will fail; for the write operation, the stripe data cannot be updated normally, after forced write, the check data cannot be matched with the user data in the stripe, so that the stripe is in a state of inconsistent data, and the data with errors can be read again. Thus, a distributed storage system is generally unable to continue to provide service when the failure margin exceeds the redundancy capabilities of the redundancy algorithm itself.

Disclosure of Invention

In view of this, the present invention provides an overrun protection method and apparatus in a distributed storage system, which can enhance the reliability of the distributed storage system.

In order to achieve the purpose, the invention provides the following technical scheme:

an overrun protection method in a distributed storage system, comprising:

when the number of the fault disks in the distributed storage system exceeds a redundancy capacity threshold value m of the distributed storage system, selecting m fault disks from the fault disks as common offline disks, and taking the rest fault disks as overrun disks;

if the permanently damaged overrun disks do not exist in all the overrun disks, suspending upper layer read/write, storing the cache data of all the current strips, and entering an overrun waiting process;

in the overrun waiting process, for each overrun disk of fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk;

and if all the overrun disks are recovered, activating the upper layer read/write, and finishing the overrun waiting process.

An overrun protection arrangement in a distributed storage system, comprising:

the dividing unit is used for selecting m fault disks from the fault disks as common offline disks and taking the rest fault disks as overrun disks when the number of the fault disks in the distributed storage system exceeds a redundancy capability threshold value m of the distributed storage system;

the judging unit is used for judging whether the overrun disk is a permanently damaged disk or not;

the suspension unit is used for suspending upper layer read/write, storing the cache data of all current strips and entering an overrun waiting process if the judgment unit judges that the permanently damaged overrun disks do not exist in all the overrun disks;

the processing unit is used for writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk for each overrun disk which is recovered in a fault in the overrun waiting process, and the overrun disk is recovered in an overrun mode;

and the recovery unit is used for activating the upper layer read/write and ending the overrun waiting process if all the overrun disks are overrun and recovered.

According to the technical scheme, when the overrun condition occurs in the distributed storage system, the fault disk is divided into the ordinary offline disk and the overrun disk, under the condition that the permanently damaged overrun disk does not exist, the overrun waiting process is started by suspending the upper layer read/write operation, and the overrun disk is subjected to fault recovery processing in the process, so that the stripe data are kept consistent, the overrun condition of the distributed system can be continuously and normally operated after being relieved, and the reliability of the distributed storage system can be effectively enhanced.

Drawings

FIG. 1 is a flow chart of a method for over-limit protection in a distributed storage system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an overrun protection device in a distributed storage system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention are described in detail below with reference to the accompanying drawings according to embodiments.

In a distributed storage system, the following three failure scenarios are encountered in daily life:

1. physical faults of the magnetic disk, such as circuit damage, magnetic head damage, firmware damage, bad sectors and the like;

2. manually and mistakenly pulling out a plurality of member discs simultaneously;

3. the failure of the interconnection network causes the disks on a plurality of nodes to be disconnected simultaneously;

in the three fault scenes, the randomness of the scene 1 is high, the disk is permanently damaged, the maintenance consumes time, and the interval time between the disk and the next line is long; the failure recovery time of the scene 2 and the scene 3 is short, and the interval time of the disk re-loading is short.

When scenarios 2 and 3 occur, if the unavailability caused by too many failed disks in the distributed storage system can be avoided, the reliability of the distributed storage system can be greatly enhanced. The invention aims to reduce the harm of the overrun condition when scenes 2 and 3 occur and ensure the integrity and the rapid service capability of the data of the distributed storage system.

The following describes the implementation principle of the present application in detail:

in the invention, the overrun means exceeding the limit of the redundancy capability of the distributed storage system, namely the number of the failed disks exceeds the threshold value of the redundancy capability of the distributed storage system. For example, RAID5 allows 1 disk to fail, the redundancy capability threshold of a distributed storage system employing RAID5 is 1, and if the number of failed disks in the distributed storage system exceeds 1, an overrun condition occurs in the distributed storage system. As another example, RAID6 allows 2 disks to fail, and a distributed storage system employing RAID6 has a redundancy capability threshold of 2, and if the failed disk in the distributed storage system exceeds 2, the distributed storage system is overrun.

Referring to fig. 1, fig. 1 is a flowchart of an overrun protection method in a distributed storage system according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step 101, when the number of the failed disks in the distributed storage system exceeds the redundancy capability threshold value m of the distributed storage system, selecting m failed disks from the failed disks as common offline disks, and using the rest failed disks as overrun disks.

The method is suitable for the linux system, is particularly applied to the kernel of the linux system, and is used as a kernel driver of the linux system (for convenience of distinction, the kernel driver is called as an overrun kernel driver hereinafter).

The overrun kernel driver stripes data in memory between the disk and the user program. Each strip comprises a plurality of memory buffers with the same number as that of the disks in the distributed storage system, and the memory buffers contained in the strip correspond to the disks in the distributed storage system one to one.

After receiving an upper layer read/write request (namely, a read/write request from a user program), the overrun kernel driver binds the upper layer read/write request to a stripe, and the stripe completes read/write operation according to the bound upper layer read/write request. It should be noted that the upper layer read/write request is an upper layer read request or an upper layer write request, wherein if the upper layer read request is an upper layer read request, the upper layer read request can be completed after the stripe reads data from the disk; if the request is an upper layer write request, the stripe writes the data to be written of the user in the memory buffer area and the calculated check data into the disk, and then the upper layer write request can be completed.

In the process of executing read/write operation by the stripe according to the upper layer read/write request, if any disk read/write error (namely disk read failure or disk write failure) occurs, the disk failure can be determined, and the time of the disk failure can be recorded. It should be noted that, in the linux system, the kernel drives the vpool to manage the physical disk state and the virtual disk state in the system, and the stripe read/write disk is substantially a virtual disk that sends an internal read/write request to the vpool control, and when an error occurs in the read/write disk, a return value of the read/write request may indicate a cause of the error, such as physical damage, a dropped connection, and the like. Here, the read/write request sent by the overrun kernel driver to the virtual disk controlled by the vpool is referred to as an internal read/write request.

When the number of the failed disks exceeds the redundancy capacity threshold value of the distributed storage system, performing overrun protection, specifically, dividing the failed disks into ordinary offline disks and overrun disks, wherein the dividing methods are various, for example, m disks can be randomly selected as the ordinary offline disks, and the rest disks are used as the overrun disks; or sorting according to the failure time sequence, wherein m failure disks in the front of the sorting are used as common offline disks, and the rest failure disks are used as overrun disks.

And 102, judging whether permanent damaged disks exist in all the overrun disks, if not, executing a step 103, otherwise, executing a step 106.

In the embodiment of the invention, the physically damaged disk is determined as the permanently damaged disk, and the non-physically damaged disk is determined as the disk which is not permanently damaged.

And 103, suspending the read/write operation of the upper layer, storing the cache data of all the current stripes, and entering a transfinite waiting process.

An upper layer read/write operation, i.e., a read/write request from an upper layer application (i.e., a user program).

The cache data of the stripe is data stored in the memory and related to the upper layer read/write request bound to the stripe, for example, data to be written carried by the upper layer write request.

And 104, in the overrun waiting process, for each overrun disk of the fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk.

In the linux system, the online and offline events of the disk are monitored and controlled by an independent module disknet, and the disknet notifies the kernel driver of the legal online events. In the embodiment of the application, when the disknet monitors a legal online event of any overrun disk, the online notification of the overrun disk is sent to the overrun kernel driver, and the overrun kernel driver can determine the fault recovery of the overrun disk according to the received online notification of the overrun disk.

The overrun kernel driver can immediately write the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk after determining that a certain overrun disk is recovered, and can also write the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk for each overrun disk after all the overrun disks are recovered due to faults.

In practical application, after the overrun disk failure is recovered, in the process of writing the cache data of the stripe into the overrun disk, the failure may occur again, and in the case of the failure again, the nature of the failure may be judged, if the failure is permanent damage, the overrun is determined not to be recoverable, if the failure is not permanent damage, the disk may be continuously used as the overrun disk, and the overrun disk failure recovery is continuously waited in the overrun waiting process.

Therefore, in the embodiment of the present invention, in the process that the overrun cache driver writes, into the overrun disk, the cache data that needs to be written into the overrun disk among the cache data of all the stripes, if the cache data that needs to be written into the overrun disk among the cache data of any stripe fails to be written into the overrun disk, it may be further determined whether the overrun disk is permanently damaged, if so, it may be determined that the overrun condition cannot be resolved, the overrun condition of the distributed storage system is marked as unrecoverable (i.e., step 106 is executed), and if not, the disk is continuously maintained as the overrun disk.

And 105, if all the overrun disks are recovered, activating the upper layer read/write operation, and ending the overrun waiting process.

In this step, since all the overrun disks are recovered due to a failure, and the data of the cache data of all the stripes, which need to be written into the overrun disks, is also successfully written into the overrun disks, all the overrun disks are recovered to normally work at this time, the number of the failed disks in the distributed storage system is within the redundancy capability of the distributed storage system, and the distributed storage system can normally work, the upper layer read/write operation can be reactivated, the overrun waiting process is ended, and the distributed storage system starts to normally work.

And 106, determining that the overrun condition cannot be relieved, and marking that the overrun condition of the distributed storage system cannot be recovered.

In the embodiment of the invention, as long as the permanently damaged overrun disk exists, the overrun condition of the distributed storage system can not be relieved/recovered. In the case that the distributed storage system is not recoverable, an error or alarm message may also be returned to the upper layer application (i.e., the user program).

In the embodiment of the present invention shown in fig. 1, when an overrun condition occurs in the distributed storage system, that is: when the number of the fault disks exceeds the capacity threshold of the distributed storage system, the fault disks are divided into a common offline disk and an overrun disk, under the condition that the permanently damaged overrun disk does not exist, an overrun waiting process is started by suspending upper layer read/write operation, and the overrun disk is subjected to fault recovery processing in the process, so that strip data are kept consistent, and the overrun condition of the distributed storage system can continue to normally operate after being relieved, so that the reliability of the distributed storage system can be effectively enhanced.

The overrun protection method in the distributed storage system of the present invention is described in detail above, and the present invention also provides an overrun protection device in the distributed storage system, which is described in detail below with reference to fig. 2.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an overrun protection apparatus in a distributed storage system according to an embodiment of the present invention, and as shown in fig. 2, the apparatus includes:

the dividing unit 201 is configured to divide a failed disk into a common offline disk and an overrun disk when the number of failed disks in the distributed storage system exceeds a redundancy capability threshold of the distributed storage system;

a judging unit 202, configured to judge whether the overrun disk is a permanently damaged disk;

a suspending unit 203, configured to suspend upper layer read/write, store cache data of all current stripes, and enter an overrun waiting process if the determining unit 202 determines that there is no permanently damaged overrun disk in all overrun disks;

the processing unit 204 is configured to, in the overrun waiting process, write, to each overrun disk that is recovered from the failure, the cache data that needs to be written into the overrun disk in the cache data of all the stripes into the overrun disk, and perform overrun recovery on the overrun disk;

and the recovery unit 205 is configured to activate upper layer read/write and end the overrun waiting process if all the overrun disks are overrun recovered.

The apparatus shown in fig. 2 further comprises a detection unit 206;

the detecting unit 206 is configured to record a time of a disk failure when the failed disk is detected;

the dividing unit 201, when dividing the failed disk into the ordinary offline disk and the overrun disk, is configured to: sequencing according to the time sequence of the disk failures, determining the first m failed disks as ordinary offline disks, and determining other failed disks as overrun disks; wherein m is a redundancy capability threshold of the distributed storage system.

In the device shown in figure 2 of the drawings,

the detecting unit 206, when detecting whether the disk fails, is configured to: when reading data from the disk fails or writing data to the disk fails, the disk failure is determined.

In the apparatus shown in FIG. 2

The suspending unit 203 is configured to determine that the overrun condition cannot be removed and mark that the overrun condition of the distributed storage system cannot be recovered if the determining unit 202 determines that permanently damaged overrun disks exist in all the overrun disks.

The apparatus shown in fig. 2 further includes a receiving unit 207;

the receiving unit 207 is configured to receive an online notification of each failed disk;

the processing unit 204 is further configured to: if the receiving unit 207 receives an online notification of any overrun disk, it is determined that the overrun disk failure is recovered, otherwise, it is determined that the overrun disk failure is not recovered.

In the device shown in figure 2 of the drawings,

the processing unit 204 is further configured to: when the cache data needing to be written into the overrun disk in the cache data of any stripe fails to be written into the overrun disk, the instruction judging unit 202 judges whether the overrun disk is a permanently damaged disk or not, if so, the overrun condition is determined to be unable to be relieved, the overrun condition of the distributed storage system is marked to be unrecoverable, and if not, the disk is continuously kept as the overrun disk.

In the device shown in figure 2 of the drawings,

the determining unit 202, when determining whether the failed disk is permanently damaged, is configured to: if the failed disk is a failure due to physical damage, then the failed disk is determined to be a permanently damaged disk, otherwise, the failed disk is determined not to be a permanently damaged disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for over-limit protection in a distributed storage system, the method comprising:

if the permanently damaged overrun disks do not exist in all the overrun disks, suspending upper layer read/write, storing the cache data of all the current strips, and entering an overrun waiting process; the overrun waiting process is a process of waiting for overrun recovery of all the overrun disks;

2. The method of claim 1,

the method further comprises the following steps: when a fault disk is detected, recording the time of the disk fault;

the method for selecting m failed disks from the failed disks as common offline disks and using the rest failed disks as overrun disks comprises the following steps: and sequencing according to the time sequence of the disk failures, determining the first m failed disks as ordinary offline disks, and determining the rest failed disks as overrun disks.

3. The method of claim 2,

the method for detecting whether the disk fails comprises the following steps: when reading data from the disk fails or writing data to the disk fails, the disk failure is determined.

4. The method of claim 1,

the method further comprises the following steps: and if the permanently damaged overrun disks exist in all the overrun disks, determining that the overrun condition cannot be relieved, and marking that the overrun condition of the distributed storage system cannot be recovered.

5. The method of claim 1,

the method for determining fault recovery of the overrun disk comprises the following steps: and if the on-line notification of the overrun disk is received, determining that the overrun disk fails to recover, otherwise, determining that the overrun disk fails to recover.

6. The method of claim 1,

the method further comprises the following steps: when the cache data needing to be written into the overrun disk in the cache data of any strip fails to be written into the overrun disk, whether the overrun disk is a permanently damaged disk or not is judged, if yes, the overrun condition cannot be relieved, the overrun condition of the distributed storage system is marked to be unrecoverable, and if not, the disk is continuously kept as the overrun disk.

7. The method of claim 1, 5, or 6,

the method for judging whether the overrun disk is permanently damaged comprises the following steps: if the overrun disk is a failure caused by physical damage, the overrun disk is determined to be a permanently damaged disk, otherwise the overrun disk is determined not to be a permanently damaged disk.

8. An overrun protection arrangement in a distributed storage system, the arrangement comprising:

the suspension unit is used for suspending upper layer read/write, storing the cache data of all current strips and entering an overrun waiting process if the judgment unit judges that the permanently damaged overrun disks do not exist in all the overrun disks; the overrun waiting process is a process of waiting for overrun recovery of all the overrun disks;

9. The apparatus of claim 8, further comprising a detection unit;

the detection unit is used for recording the time of the failure of the disk when the failed disk is detected;

the dividing unit is used for selecting m fault disks from the fault disks as common offline disks, and when the rest fault disks are used as overrun disks, the dividing unit is used for: sequencing according to the time sequence of the disk failures, determining the first m failed disks as ordinary offline disks, and determining the rest failed disks as overrun disks; wherein m is a redundancy capability threshold of the distributed storage system.

10. The apparatus of claim 9,

the detection unit, when detecting whether the disk is faulty, is configured to: when reading data from the disk fails or writing data to the disk fails, the disk failure is determined.

11. The apparatus of claim 8,

and the suspending unit is used for determining that the overrun condition cannot be relieved and marking that the overrun condition of the distributed storage system cannot be recovered if the judging unit judges that permanently damaged overrun disks exist in all the overrun disks.

12. The apparatus of claim 8, further comprising a receiving unit;

the receiving unit is used for receiving online notifications of each fault disk;

the processing unit is further configured to: if the receiving unit receives the online notification of any overrun disk, the fault recovery of the overrun disk is determined, otherwise, the fault recovery of the overrun disk is determined.

13. The apparatus of claim 8,

the processing unit is further configured to: when the cache data needing to be written into the overrun disk in the cache data of any strip fails to be written into the overrun disk, the indication judgment unit judges whether the overrun disk is a permanently damaged disk or not, if so, the overrun condition is determined to be unable to be relieved, the overrun condition of the distributed storage system is marked to be unrecoverable, and if not, the disk is continuously kept as the overrun disk.

14. The apparatus of claim 8, 12, or 13,

the judging unit is used for judging whether the overrun disk is permanently damaged or not: if the overrun disk is a failure caused by physical damage, the overrun disk is determined to be a permanently damaged disk, otherwise the overrun disk is determined not to be a permanently damaged disk.