CN108170375B - Overrun protection method and device in distributed storage system - Google Patents

Overrun protection method and device in distributed storage system Download PDF

Info

Publication number
CN108170375B
CN108170375B CN201711389565.3A CN201711389565A CN108170375B CN 108170375 B CN108170375 B CN 108170375B CN 201711389565 A CN201711389565 A CN 201711389565A CN 108170375 B CN108170375 B CN 108170375B
Authority
CN
China
Prior art keywords
overrun
disk
disks
fault
storage system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711389565.3A
Other languages
Chinese (zh)
Other versions
CN108170375A (en
Inventor
曹锡韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innovation Technology Co ltd
Original Assignee
Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innovation Technology Co ltd filed Critical Innovation Technology Co ltd
Priority to CN201711389565.3A priority Critical patent/CN108170375B/en
Publication of CN108170375A publication Critical patent/CN108170375A/en
Application granted granted Critical
Publication of CN108170375B publication Critical patent/CN108170375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Abstract

The invention provides an overrun protection method and device in a distributed storage system, wherein the method comprises the following steps: when the number of the fault disks in the distributed storage system exceeds a redundancy capacity threshold value m of the distributed storage system, selecting m fault disks from the fault disks as common offline disks, and taking the rest fault disks as overrun disks; if the permanently damaged overrun disks do not exist in all the overrun disks, suspending upper layer read/write, storing the cache data of all the current strips, and entering an overrun waiting process; in the overrun waiting process, for each overrun disk of fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk; and if all the overrun disks are recovered, activating the upper layer read/write, and finishing the overrun waiting process.

Description

Overrun protection method and device in distributed storage system
Technical Field
The present invention relates to the field of distributed storage technologies, and in particular, to an overrun protection method and apparatus in a distributed storage system.
Background
A distributed storage system typically encodes user data according to some redundancy algorithm into data sets with redundancy characteristics that are stored scattered across multiple physical media on multiple nodes. The more common redundancy algorithms are RAID5, RAID6, RAID7, etc., and these redundancy algorithms are all used to stripe user data and add check data blocks in the stripe, and a certain mathematical formula is satisfied between the user data blocks and the check data blocks. After part of user data is invalid, the content can still be calculated by the residual data, thereby improving the reliability of the data.
However, redundancy algorithms themselves have varying degrees of protection limitations, e.g., RAID5 allows one disk to fail, RAID6 allows two disks to fail … …, and if the failure range exceeds the redundancy capabilities of the redundancy algorithm itself, there is not enough data to satisfy the mathematical calculations, resulting in data not being recoverable. Wherein, for the reading operation, if the data on the fault disk is to be read, the reading will fail; for the write operation, the stripe data cannot be updated normally, after forced write, the check data cannot be matched with the user data in the stripe, so that the stripe is in a state of inconsistent data, and the data with errors can be read again. Thus, a distributed storage system is generally unable to continue to provide service when the failure margin exceeds the redundancy capabilities of the redundancy algorithm itself.
Disclosure of Invention
In view of this, the present invention provides an overrun protection method and apparatus in a distributed storage system, which can enhance the reliability of the distributed storage system.
In order to achieve the purpose, the invention provides the following technical scheme:
an overrun protection method in a distributed storage system, comprising:
when the number of the fault disks in the distributed storage system exceeds a redundancy capacity threshold value m of the distributed storage system, selecting m fault disks from the fault disks as common offline disks, and taking the rest fault disks as overrun disks;
if the permanently damaged overrun disks do not exist in all the overrun disks, suspending upper layer read/write, storing the cache data of all the current strips, and entering an overrun waiting process;
in the overrun waiting process, for each overrun disk of fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk;
and if all the overrun disks are recovered, activating the upper layer read/write, and finishing the overrun waiting process.
An overrun protection arrangement in a distributed storage system, comprising:
the dividing unit is used for selecting m fault disks from the fault disks as common offline disks and taking the rest fault disks as overrun disks when the number of the fault disks in the distributed storage system exceeds a redundancy capability threshold value m of the distributed storage system;
the judging unit is used for judging whether the overrun disk is a permanently damaged disk or not;
the suspension unit is used for suspending upper layer read/write, storing the cache data of all current strips and entering an overrun waiting process if the judgment unit judges that the permanently damaged overrun disks do not exist in all the overrun disks;
the processing unit is used for writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk for each overrun disk which is recovered in a fault in the overrun waiting process, and the overrun disk is recovered in an overrun mode;
and the recovery unit is used for activating the upper layer read/write and ending the overrun waiting process if all the overrun disks are overrun and recovered.
According to the technical scheme, when the overrun condition occurs in the distributed storage system, the fault disk is divided into the ordinary offline disk and the overrun disk, under the condition that the permanently damaged overrun disk does not exist, the overrun waiting process is started by suspending the upper layer read/write operation, and the overrun disk is subjected to fault recovery processing in the process, so that the stripe data are kept consistent, the overrun condition of the distributed system can be continuously and normally operated after being relieved, and the reliability of the distributed storage system can be effectively enhanced.
Drawings
FIG. 1 is a flow chart of a method for over-limit protection in a distributed storage system according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an overrun protection device in a distributed storage system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention are described in detail below with reference to the accompanying drawings according to embodiments.
In a distributed storage system, the following three failure scenarios are encountered in daily life:
1. physical faults of the magnetic disk, such as circuit damage, magnetic head damage, firmware damage, bad sectors and the like;
2. manually and mistakenly pulling out a plurality of member discs simultaneously;
3. the failure of the interconnection network causes the disks on a plurality of nodes to be disconnected simultaneously;
in the three fault scenes, the randomness of the scene 1 is high, the disk is permanently damaged, the maintenance consumes time, and the interval time between the disk and the next line is long; the failure recovery time of the scene 2 and the scene 3 is short, and the interval time of the disk re-loading is short.
When scenarios 2 and 3 occur, if the unavailability caused by too many failed disks in the distributed storage system can be avoided, the reliability of the distributed storage system can be greatly enhanced. The invention aims to reduce the harm of the overrun condition when scenes 2 and 3 occur and ensure the integrity and the rapid service capability of the data of the distributed storage system.
The following describes the implementation principle of the present application in detail:
in the invention, the overrun means exceeding the limit of the redundancy capability of the distributed storage system, namely the number of the failed disks exceeds the threshold value of the redundancy capability of the distributed storage system. For example, RAID5 allows 1 disk to fail, the redundancy capability threshold of a distributed storage system employing RAID5 is 1, and if the number of failed disks in the distributed storage system exceeds 1, an overrun condition occurs in the distributed storage system. As another example, RAID6 allows 2 disks to fail, and a distributed storage system employing RAID6 has a redundancy capability threshold of 2, and if the failed disk in the distributed storage system exceeds 2, the distributed storage system is overrun.
Referring to fig. 1, fig. 1 is a flowchart of an overrun protection method in a distributed storage system according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
step 101, when the number of the failed disks in the distributed storage system exceeds the redundancy capability threshold value m of the distributed storage system, selecting m failed disks from the failed disks as common offline disks, and using the rest failed disks as overrun disks.
The method is suitable for the linux system, is particularly applied to the kernel of the linux system, and is used as a kernel driver of the linux system (for convenience of distinction, the kernel driver is called as an overrun kernel driver hereinafter).
The overrun kernel driver stripes data in memory between the disk and the user program. Each strip comprises a plurality of memory buffers with the same number as that of the disks in the distributed storage system, and the memory buffers contained in the strip correspond to the disks in the distributed storage system one to one.
After receiving an upper layer read/write request (namely, a read/write request from a user program), the overrun kernel driver binds the upper layer read/write request to a stripe, and the stripe completes read/write operation according to the bound upper layer read/write request. It should be noted that the upper layer read/write request is an upper layer read request or an upper layer write request, wherein if the upper layer read request is an upper layer read request, the upper layer read request can be completed after the stripe reads data from the disk; if the request is an upper layer write request, the stripe writes the data to be written of the user in the memory buffer area and the calculated check data into the disk, and then the upper layer write request can be completed.
In the process of executing read/write operation by the stripe according to the upper layer read/write request, if any disk read/write error (namely disk read failure or disk write failure) occurs, the disk failure can be determined, and the time of the disk failure can be recorded. It should be noted that, in the linux system, the kernel drives the vpool to manage the physical disk state and the virtual disk state in the system, and the stripe read/write disk is substantially a virtual disk that sends an internal read/write request to the vpool control, and when an error occurs in the read/write disk, a return value of the read/write request may indicate a cause of the error, such as physical damage, a dropped connection, and the like. Here, the read/write request sent by the overrun kernel driver to the virtual disk controlled by the vpool is referred to as an internal read/write request.
When the number of the failed disks exceeds the redundancy capacity threshold value of the distributed storage system, performing overrun protection, specifically, dividing the failed disks into ordinary offline disks and overrun disks, wherein the dividing methods are various, for example, m disks can be randomly selected as the ordinary offline disks, and the rest disks are used as the overrun disks; or sorting according to the failure time sequence, wherein m failure disks in the front of the sorting are used as common offline disks, and the rest failure disks are used as overrun disks.
And 102, judging whether permanent damaged disks exist in all the overrun disks, if not, executing a step 103, otherwise, executing a step 106.
In the embodiment of the invention, the physically damaged disk is determined as the permanently damaged disk, and the non-physically damaged disk is determined as the disk which is not permanently damaged.
And 103, suspending the read/write operation of the upper layer, storing the cache data of all the current stripes, and entering a transfinite waiting process.
An upper layer read/write operation, i.e., a read/write request from an upper layer application (i.e., a user program).
The cache data of the stripe is data stored in the memory and related to the upper layer read/write request bound to the stripe, for example, data to be written carried by the upper layer write request.
And 104, in the overrun waiting process, for each overrun disk of the fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk.
In the linux system, the online and offline events of the disk are monitored and controlled by an independent module disknet, and the disknet notifies the kernel driver of the legal online events. In the embodiment of the application, when the disknet monitors a legal online event of any overrun disk, the online notification of the overrun disk is sent to the overrun kernel driver, and the overrun kernel driver can determine the fault recovery of the overrun disk according to the received online notification of the overrun disk.
The overrun kernel driver can immediately write the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk after determining that a certain overrun disk is recovered, and can also write the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk for each overrun disk after all the overrun disks are recovered due to faults.
In practical application, after the overrun disk failure is recovered, in the process of writing the cache data of the stripe into the overrun disk, the failure may occur again, and in the case of the failure again, the nature of the failure may be judged, if the failure is permanent damage, the overrun is determined not to be recoverable, if the failure is not permanent damage, the disk may be continuously used as the overrun disk, and the overrun disk failure recovery is continuously waited in the overrun waiting process.
Therefore, in the embodiment of the present invention, in the process that the overrun cache driver writes, into the overrun disk, the cache data that needs to be written into the overrun disk among the cache data of all the stripes, if the cache data that needs to be written into the overrun disk among the cache data of any stripe fails to be written into the overrun disk, it may be further determined whether the overrun disk is permanently damaged, if so, it may be determined that the overrun condition cannot be resolved, the overrun condition of the distributed storage system is marked as unrecoverable (i.e., step 106 is executed), and if not, the disk is continuously maintained as the overrun disk.
And 105, if all the overrun disks are recovered, activating the upper layer read/write operation, and ending the overrun waiting process.
In this step, since all the overrun disks are recovered due to a failure, and the data of the cache data of all the stripes, which need to be written into the overrun disks, is also successfully written into the overrun disks, all the overrun disks are recovered to normally work at this time, the number of the failed disks in the distributed storage system is within the redundancy capability of the distributed storage system, and the distributed storage system can normally work, the upper layer read/write operation can be reactivated, the overrun waiting process is ended, and the distributed storage system starts to normally work.
And 106, determining that the overrun condition cannot be relieved, and marking that the overrun condition of the distributed storage system cannot be recovered.
In the embodiment of the invention, as long as the permanently damaged overrun disk exists, the overrun condition of the distributed storage system can not be relieved/recovered. In the case that the distributed storage system is not recoverable, an error or alarm message may also be returned to the upper layer application (i.e., the user program).
In the embodiment of the present invention shown in fig. 1, when an overrun condition occurs in the distributed storage system, that is: when the number of the fault disks exceeds the capacity threshold of the distributed storage system, the fault disks are divided into a common offline disk and an overrun disk, under the condition that the permanently damaged overrun disk does not exist, an overrun waiting process is started by suspending upper layer read/write operation, and the overrun disk is subjected to fault recovery processing in the process, so that strip data are kept consistent, and the overrun condition of the distributed storage system can continue to normally operate after being relieved, so that the reliability of the distributed storage system can be effectively enhanced.
The overrun protection method in the distributed storage system of the present invention is described in detail above, and the present invention also provides an overrun protection device in the distributed storage system, which is described in detail below with reference to fig. 2.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an overrun protection apparatus in a distributed storage system according to an embodiment of the present invention, and as shown in fig. 2, the apparatus includes:
the dividing unit 201 is configured to divide a failed disk into a common offline disk and an overrun disk when the number of failed disks in the distributed storage system exceeds a redundancy capability threshold of the distributed storage system;
a judging unit 202, configured to judge whether the overrun disk is a permanently damaged disk;
a suspending unit 203, configured to suspend upper layer read/write, store cache data of all current stripes, and enter an overrun waiting process if the determining unit 202 determines that there is no permanently damaged overrun disk in all overrun disks;
the processing unit 204 is configured to, in the overrun waiting process, write, to each overrun disk that is recovered from the failure, the cache data that needs to be written into the overrun disk in the cache data of all the stripes into the overrun disk, and perform overrun recovery on the overrun disk;
and the recovery unit 205 is configured to activate upper layer read/write and end the overrun waiting process if all the overrun disks are overrun recovered.
The apparatus shown in fig. 2 further comprises a detection unit 206;
the detecting unit 206 is configured to record a time of a disk failure when the failed disk is detected;
the dividing unit 201, when dividing the failed disk into the ordinary offline disk and the overrun disk, is configured to: sequencing according to the time sequence of the disk failures, determining the first m failed disks as ordinary offline disks, and determining other failed disks as overrun disks; wherein m is a redundancy capability threshold of the distributed storage system.
In the device shown in figure 2 of the drawings,
the detecting unit 206, when detecting whether the disk fails, is configured to: when reading data from the disk fails or writing data to the disk fails, the disk failure is determined.
In the apparatus shown in FIG. 2
The suspending unit 203 is configured to determine that the overrun condition cannot be removed and mark that the overrun condition of the distributed storage system cannot be recovered if the determining unit 202 determines that permanently damaged overrun disks exist in all the overrun disks.
The apparatus shown in fig. 2 further includes a receiving unit 207;
the receiving unit 207 is configured to receive an online notification of each failed disk;
the processing unit 204 is further configured to: if the receiving unit 207 receives an online notification of any overrun disk, it is determined that the overrun disk failure is recovered, otherwise, it is determined that the overrun disk failure is not recovered.
In the device shown in figure 2 of the drawings,
the processing unit 204 is further configured to: when the cache data needing to be written into the overrun disk in the cache data of any stripe fails to be written into the overrun disk, the instruction judging unit 202 judges whether the overrun disk is a permanently damaged disk or not, if so, the overrun condition is determined to be unable to be relieved, the overrun condition of the distributed storage system is marked to be unrecoverable, and if not, the disk is continuously kept as the overrun disk.
In the device shown in figure 2 of the drawings,
the determining unit 202, when determining whether the failed disk is permanently damaged, is configured to: if the failed disk is a failure due to physical damage, then the failed disk is determined to be a permanently damaged disk, otherwise, the failed disk is determined not to be a permanently damaged disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (14)

1. A method for over-limit protection in a distributed storage system, the method comprising:
when the number of the fault disks in the distributed storage system exceeds a redundancy capacity threshold value m of the distributed storage system, selecting m fault disks from the fault disks as common offline disks, and taking the rest fault disks as overrun disks;
if the permanently damaged overrun disks do not exist in all the overrun disks, suspending upper layer read/write, storing the cache data of all the current strips, and entering an overrun waiting process; the overrun waiting process is a process of waiting for overrun recovery of all the overrun disks;
in the overrun waiting process, for each overrun disk of fault recovery, writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk, and performing overrun recovery on the overrun disk;
and if all the overrun disks are recovered, activating the upper layer read/write, and finishing the overrun waiting process.
2. The method of claim 1,
the method further comprises the following steps: when a fault disk is detected, recording the time of the disk fault;
the method for selecting m failed disks from the failed disks as common offline disks and using the rest failed disks as overrun disks comprises the following steps: and sequencing according to the time sequence of the disk failures, determining the first m failed disks as ordinary offline disks, and determining the rest failed disks as overrun disks.
3. The method of claim 2,
the method for detecting whether the disk fails comprises the following steps: when reading data from the disk fails or writing data to the disk fails, the disk failure is determined.
4. The method of claim 1,
the method further comprises the following steps: and if the permanently damaged overrun disks exist in all the overrun disks, determining that the overrun condition cannot be relieved, and marking that the overrun condition of the distributed storage system cannot be recovered.
5. The method of claim 1,
the method for determining fault recovery of the overrun disk comprises the following steps: and if the on-line notification of the overrun disk is received, determining that the overrun disk fails to recover, otherwise, determining that the overrun disk fails to recover.
6. The method of claim 1,
the method further comprises the following steps: when the cache data needing to be written into the overrun disk in the cache data of any strip fails to be written into the overrun disk, whether the overrun disk is a permanently damaged disk or not is judged, if yes, the overrun condition cannot be relieved, the overrun condition of the distributed storage system is marked to be unrecoverable, and if not, the disk is continuously kept as the overrun disk.
7. The method of claim 1, 5, or 6,
the method for judging whether the overrun disk is permanently damaged comprises the following steps: if the overrun disk is a failure caused by physical damage, the overrun disk is determined to be a permanently damaged disk, otherwise the overrun disk is determined not to be a permanently damaged disk.
8. An overrun protection arrangement in a distributed storage system, the arrangement comprising:
the dividing unit is used for selecting m fault disks from the fault disks as common offline disks and taking the rest fault disks as overrun disks when the number of the fault disks in the distributed storage system exceeds a redundancy capability threshold value m of the distributed storage system;
the judging unit is used for judging whether the overrun disk is a permanently damaged disk or not;
the suspension unit is used for suspending upper layer read/write, storing the cache data of all current strips and entering an overrun waiting process if the judgment unit judges that the permanently damaged overrun disks do not exist in all the overrun disks; the overrun waiting process is a process of waiting for overrun recovery of all the overrun disks;
the processing unit is used for writing the cache data which needs to be written into the overrun disk in the cache data of all the strips into the overrun disk for each overrun disk which is recovered in a fault in the overrun waiting process, and the overrun disk is recovered in an overrun mode;
and the recovery unit is used for activating the upper layer read/write and ending the overrun waiting process if all the overrun disks are overrun and recovered.
9. The apparatus of claim 8, further comprising a detection unit;
the detection unit is used for recording the time of the failure of the disk when the failed disk is detected;
the dividing unit is used for selecting m fault disks from the fault disks as common offline disks, and when the rest fault disks are used as overrun disks, the dividing unit is used for: sequencing according to the time sequence of the disk failures, determining the first m failed disks as ordinary offline disks, and determining the rest failed disks as overrun disks; wherein m is a redundancy capability threshold of the distributed storage system.
10. The apparatus of claim 9,
the detection unit, when detecting whether the disk is faulty, is configured to: when reading data from the disk fails or writing data to the disk fails, the disk failure is determined.
11. The apparatus of claim 8,
and the suspending unit is used for determining that the overrun condition cannot be relieved and marking that the overrun condition of the distributed storage system cannot be recovered if the judging unit judges that permanently damaged overrun disks exist in all the overrun disks.
12. The apparatus of claim 8, further comprising a receiving unit;
the receiving unit is used for receiving online notifications of each fault disk;
the processing unit is further configured to: if the receiving unit receives the online notification of any overrun disk, the fault recovery of the overrun disk is determined, otherwise, the fault recovery of the overrun disk is determined.
13. The apparatus of claim 8,
the processing unit is further configured to: when the cache data needing to be written into the overrun disk in the cache data of any strip fails to be written into the overrun disk, the indication judgment unit judges whether the overrun disk is a permanently damaged disk or not, if so, the overrun condition is determined to be unable to be relieved, the overrun condition of the distributed storage system is marked to be unrecoverable, and if not, the disk is continuously kept as the overrun disk.
14. The apparatus of claim 8, 12, or 13,
the judging unit is used for judging whether the overrun disk is permanently damaged or not: if the overrun disk is a failure caused by physical damage, the overrun disk is determined to be a permanently damaged disk, otherwise the overrun disk is determined not to be a permanently damaged disk.
CN201711389565.3A 2017-12-21 2017-12-21 Overrun protection method and device in distributed storage system Active CN108170375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711389565.3A CN108170375B (en) 2017-12-21 2017-12-21 Overrun protection method and device in distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711389565.3A CN108170375B (en) 2017-12-21 2017-12-21 Overrun protection method and device in distributed storage system

Publications (2)

Publication Number Publication Date
CN108170375A CN108170375A (en) 2018-06-15
CN108170375B true CN108170375B (en) 2020-12-18

Family

ID=62523194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711389565.3A Active CN108170375B (en) 2017-12-21 2017-12-21 Overrun protection method and device in distributed storage system

Country Status (1)

Country Link
CN (1) CN108170375B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968456B (en) * 2018-09-30 2023-05-02 阿里巴巴集团控股有限公司 Method and device for processing fault disk in distributed storage system
CN109445712A (en) * 2018-11-09 2019-03-08 浪潮电子信息产业股份有限公司 A kind of command processing method, system, equipment and computer readable storage medium
CN113672437A (en) * 2021-07-31 2021-11-19 济南浪潮数据技术有限公司 Disk fault processing method and device for distributed storage system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630237A (en) * 2009-08-25 2010-01-20 成都市华为赛门铁克科技有限公司 Method, device and system for controlling RAID
JP2011065469A (en) * 2009-09-17 2011-03-31 Kddi Corp Distributed file system and node start-up method in distributed file system
CN102981870A (en) * 2012-11-05 2013-03-20 曙光信息产业(北京)有限公司 Magnetic disk off-line processing method in Linux system
CN105095008A (en) * 2015-08-25 2015-11-25 国电南瑞科技股份有限公司 Distributed task fault redundancy method suitable for cluster system
CN105335251A (en) * 2015-09-23 2016-02-17 浪潮(北京)电子信息产业有限公司 Fault recovery method and system
CN107273231A (en) * 2016-04-07 2017-10-20 阿里巴巴集团控股有限公司 Distributed memory system hard disk tangles fault detect, processing method and processing device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9456035B2 (en) * 2013-05-03 2016-09-27 International Business Machines Corporation Storing related data in a dispersed storage network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630237A (en) * 2009-08-25 2010-01-20 成都市华为赛门铁克科技有限公司 Method, device and system for controlling RAID
JP2011065469A (en) * 2009-09-17 2011-03-31 Kddi Corp Distributed file system and node start-up method in distributed file system
CN102981870A (en) * 2012-11-05 2013-03-20 曙光信息产业(北京)有限公司 Magnetic disk off-line processing method in Linux system
CN105095008A (en) * 2015-08-25 2015-11-25 国电南瑞科技股份有限公司 Distributed task fault redundancy method suitable for cluster system
CN105335251A (en) * 2015-09-23 2016-02-17 浪潮(北京)电子信息产业有限公司 Fault recovery method and system
CN107273231A (en) * 2016-04-07 2017-10-20 阿里巴巴集团控股有限公司 Distributed memory system hard disk tangles fault detect, processing method and processing device

Also Published As

Publication number Publication date
CN108170375A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
US8943358B2 (en) Storage system, apparatus, and method for failure recovery during unsuccessful rebuild process
US9009526B2 (en) Rebuilding drive data
CN100530125C (en) Safety storage method for data
US7587631B2 (en) RAID controller, RAID system and control method for RAID controller
US7490103B2 (en) Method and system for backing up data
US10120769B2 (en) Raid rebuild algorithm with low I/O impact
US7761660B1 (en) Identifying suspect disks
US20060236161A1 (en) Apparatus and method for controlling disk array with redundancy
US9529674B2 (en) Storage device management of unrecoverable logical block addresses for RAID data regeneration
KR100711165B1 (en) Apparatus, method and recording medium for the control of storage
CN108170375B (en) Overrun protection method and device in distributed storage system
US20110202791A1 (en) Storage control device , a storage system, a storage control method and a program thereof
CN109710456B (en) Data recovery method and device
KR20060043873A (en) System and method for drive recovery following a drive failure
CN110750213A (en) Hard disk management method and device
US10606490B2 (en) Storage control device and storage control method for detecting storage device in potential fault state
CN105138280A (en) Data write-in method, apparatus and system
US20070234107A1 (en) Dynamic storage data protection
US7529776B2 (en) Multiple copy track stage recovery in a data storage system
CN111240903A (en) Data recovery method and related equipment
US7174476B2 (en) Methods and structure for improved fault tolerance during initialization of a RAID logical unit
JP2006079219A (en) Disk array controller and disk array control method
JP4143040B2 (en) Disk array control device, processing method and program for data loss detection applied to the same
US20100169572A1 (en) Data storage method, apparatus and system for interrupted write recovery
JP2016057876A (en) Information processing apparatus, input/output control program, and input/output control method

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 505, Taixing Building, 11 Garden East Road, Haidian District, Beijing, 100191

Applicant after: Innovation Technology Co., Ltd.

Address before: Room 0801-0805, 51 College Road, Haidian District, Beijing, 100191

Applicant before: Innovation and Technology Storage Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant