CN110968456B

CN110968456B - Method and device for processing fault disk in distributed storage system

Info

Publication number: CN110968456B
Application number: CN201811156593.5A
Authority: CN
Inventors: 王勇; 王鹏; 闫宁; 林江彬
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2023-05-02
Anticipated expiration: 2038-09-30
Also published as: CN110968456A

Abstract

The invention discloses a method and a device for processing a fault disk in a distributed storage system. Wherein the method comprises the following steps: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining a redundancy level of the target disk when the offline task is processed based on the processing priority; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program. The invention solves the technical problem that the data security is difficult to ensure when the disk faults in the distributed storage system are processed in the prior art.

Description

Method and device for processing fault disk in distributed storage system

Technical Field

The invention relates to the field of fault processing of a distributed storage system, in particular to a method and a device for processing a fault disk in the distributed storage system.

Background

In the case of large-scale deployment of distributed storage systems, even a small probability of failure (e.g., 1% -2% disk year, 7% machine year) will have a significant amount of data per day of disk, machine failure. These faulty devices affect the reliability of the system on the one hand and on the other hand, the available resources of the system are reduced, which also results in a waste of costs.

To solve this problem, a method for automatically handling a failure based on a state machine is provided in the related art, but for handling a disk failure, the method simply comes down to a failure of a machine, and measures are taken to reformat the machine, reload an operating system, or replace the entire machine. The coarse-grained processing method is too coarse for a machine with tens of large-capacity disks, can cause large data replication, brings great influence to a system, and does not solve the key problem of processing two disks, namely backup data and control data security.

In addition, the prior art FBAR (film cavity acoustic resonator) system is a workflow system, and is a separate system, for example, when a machine fault occurs, the FBAR system identifies the machine as "to be maintained" and then a subsequent operation is performed by a maintenance procedure, and the FBAR system does not consider concurrent processing if other operation and maintenance operations are performed, which easily causes problems of data availability and concurrent efficiency. In addition, the FBAR system does not solve the problem of how to secure data in the storage system.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for processing a fault disk in a distributed storage system, which at least solve the technical problem that the data security is difficult to ensure when the disk in the distributed storage system is processed in the prior art.

According to an aspect of an embodiment of the present invention, there is provided a method for processing a failed disk in a distributed storage system, including: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

According to another aspect of the embodiment of the present invention, there is also provided a device for processing a failed disk in a distributed storage system, including: the generation module is used for generating an offline task aiming at a target disk in the distributed storage system under the condition that the fault of the target disk is detected; the first determining module is used for determining the processing priority of the offline task; the second determining module is used for determining the redundancy level of the target disk when the offline task is processed in turn; and the processing module is used for carrying out offline processing on the target disk through a disk processing program if the redundancy level of the target disk meets the offline condition.

According to another aspect of the embodiment of the present invention, there is also provided a storage medium including a stored program, wherein the device on which the storage medium is controlled to execute the following steps when the program is executed: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

According to another aspect of an embodiment of the present invention, there is also provided a computer system including: a processor; and a memory, coupled to the processor, for providing instructions to the processor to process the steps of: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

In the embodiment of the invention, the offline task aiming at the target disk is generated under the condition that the fault of the target disk in the distributed storage system is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Based on the processing method of the fault disk in the distributed storage system, the offline processing of the fault disk is realized in a disk processing-oriented mode, errors in the offline processing of the fault disk can be avoided in a disk state-based processing mode, the robustness of the distributed storage system is ensured, the processing reentrant of the distributed storage system can be ensured, and the correctness of processing logic is ensured.

Therefore, the method and the device achieve the aim of ensuring data security when the disk faults are processed in the distributed storage system, achieve the technical effect of improving the efficiency of the offline processing of the faulty disk, and further solve the technical problem that the data security is difficult to ensure when the disk faults are processed in the distributed storage system in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a method of handling failed disks in a distributed storage system according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of handling a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of handling failed disks in an alternative distributed storage system according to an embodiment of the present invention;

FIG. 4 is a flow chart of a method of handling failed disks in an alternative distributed storage system according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method of handling failed disks in an alternative distributed storage system according to an embodiment of the present invention;

FIG. 6 is a flow chart of a method of handling failed disks in an alternative distributed storage system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an apparatus for handling failed disks in a distributed storage system according to an embodiment of the invention; and

Fig. 8 is a block diagram of a computer terminal according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:

configuration management database CMDB: the method is used for storing and managing various configuration information of equipment in an enterprise IT architecture, is closely connected with all service support and service delivery flows, supports the operation of the flows, exerts the value of the configuration information, and simultaneously ensures the accuracy of data depending on the related flows.

Storage node: refers to a machine for storing application data and may generally include several storage media.

Distributed storage system: refers to a storage system comprising several storage nodes, typically employing distributed algorithms to provide high availability, fault tolerance and high performance.

Failure: refers to an abnormal condition that may cause the distributed storage system or its functionality to fail.

Events: a significant change in disk state that can be recognized by the distributed system.

Tasks: may be used to describe the transactional work that needs to be handled.

Example 1

In accordance with an embodiment of the present invention, there is also provided a method embodiment of a method for handling failed disks in a distributed storage system, where the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions, and where the logical sequence is shown in the flowchart, the steps shown or described may in some cases be performed in a different order than what is shown or described herein.

The method embodiment provided in embodiment 1 of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a method of handling failed disks in a distributed storage system. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more processors 102 (shown as 102a, 102b, … …,102 n) which may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to a method for processing a failed disk in a distributed storage system according to an embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implements the method for processing a failed disk in a distributed storage system of an application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the above operating environment, the present application provides a method for processing a failed disk in a distributed storage system as shown in fig. 2, and fig. 2 is a flowchart of a method for processing a failed disk in a distributed storage system according to an embodiment of the present invention, as shown in fig. 2, where the method includes the following method steps:

step S202, generating an offline task aiming at a target disk in the distributed storage system under the condition that the fault of the target disk is detected.

In an alternative embodiment, the condition that the target disk has a fault may include, but is not limited to, any of the following: intermittent failure: the reading and writing cannot be successfully performed; medium damage: for example, one or more binary bits are permanently corrupted and cannot be read correctly; write failure: when writing sectors, previously written sectors cannot be written nor retrieved, e.g., power is turned off during writing of sectors; disk crash: the entire disk is permanently unreadable.

It should be noted that, in the embodiment of the present application, the target states of each component may be recorded by a configuration management database in the distributed storage system, and a disk detection program running in the distributed storage system is used to detect whether the target disk has a fault, for example, the disk detection program may, but is not limited to, continuously scan the states of each target disk, and report an event that the target disk has a fault to the configuration management database when detecting that the states of the target disk have an abnormality, where the configuration management database records the event, and an event dispatcher running in the distributed storage system generates an offline task for the target disk according to the event.

Step S204, determining the processing priority of the offline task.

In the embodiment of the present application, the processing priority of the offline task may be determined, but is not limited to, according to an approval program running in the distributed storage system.

As an optional embodiment, the approval program is used for operation approval in charge of processing operation and maintenance tasks, and the approval program can continuously obtain the task to be processed currently from the configuration management database and determine the processing priority of each task.

Under the condition that the task acquired by the approval program is the offline task of the target disk, the approval program can comprehensively consider the processing priority of each currently to-be-processed task, and determine the sequence of processing each task and the processing time of processing each task according to the processing priority.

Step S206, determining the redundancy level of the target disk when the offline task is processed.

It should be noted that, when the process of the offline task is completed, the approval program may determine whether the redundancy level of the data information of the target disk meets the offline condition by detecting whether the data information of the target disk is backed up; if the redundancy level of the data information of the target disk meets the offline condition, the data information of the target disk is backed up, and the read-write of the data information is not affected after the target disk is offline.

In step S208, if the redundancy level of the target disk meets the offline condition, the target disk is processed offline by the disk processing program.

In the embodiment of the application, under the condition that the redundancy level of the target disk accords with the offline condition, the approval program can modify the local attribute state of the target disk into the offline state so as to ensure that the condition of restarting the program is met in time, and the target disk is not misused; and simultaneously, the approval program modifies the state of the offline task of the target disk in the configuration management database into an offline state.

Furthermore, under the condition that the offline task processing of the target disk can be processed, the disk processing program running in the distributed storage system can remove the target disk with the attribute state of the offline state from the distributed storage system, change the state of the offline task of the target disk into the completed state, and simultaneously generate and initiate a work order to inform a worker to dial the target disk at any time so as to repair the target disk.

It should be noted that, each program running in the distributed storage system in the embodiment of the present application may, but is not limited to, perform independent processing only according to whether the target disk is in a certain state, and when any one piece of software encounters an anomaly and is pulled up again, perform processing at any time according to the current state of the target disk.

In an alternative embodiment, before detecting that there is a failure of a target disk in the distributed storage system, the method further includes the following method steps:

and scanning the working state of each disk, and determining whether each disk has a fault according to a fault judging rule.

As an alternative embodiment, the distributed storage system in the embodiment of the present application may define a set of failure determination rules for detecting disk failures in advance, for example, detect a state of reading and writing data on a disk, determine whether the data can be successfully read and written, and whether there is a read failure or a write failure; and determining whether each disk has faults according to the working states of each disk scanned by a disk detection program running in the distributed storage system and the fault judgment rule.

In an alternative embodiment, the step S202, in the case that a failure of a target disk in the distributed storage system is detected, generates an offline task for the target disk, which may be implemented by the following steps:

step S2020, reporting the event of the fault of the target disk to a configuration management database;

step S2022 generates the offline task of the target disk according to the event, and issues the offline task.

In an alternative embodiment, the method may, but not limited to, detect whether the target disk has a failure through a disk detection program running in the distributed storage system, report an event that the target disk has a failure to a configuration management database in the distributed storage system when detecting that the target disk under the distributed storage system has a failure, record the event in the configuration management database, generate an offline task for the target disk according to the event by an event dispatcher running in the distributed storage system, and issue the offline task.

In an alternative embodiment, fig. 3 is a flowchart of a method for processing a failed disk in an alternative distributed storage system according to an embodiment of the present invention, as shown in fig. 3, where the determining the processing priority of the offline task in step S204 may be implemented by the following steps:

step S302, determining a task to be processed in a task operation and maintenance operation room under the distributed storage system;

step S304, determining the data influence range of the target disk after offline;

step S306, determining the backup state of the target disk;

step S308, determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

In this embodiment of the present application, but not limited to, according to an approval program running in a distributed storage system, a current task to be processed is obtained from a task operation and maintenance operation room under the distributed storage system, a data influence range after a target disk is offline and a backup state of the target disk are determined, and the processing priority of the offline task is determined by comprehensively considering the task to be processed, the data influence range of the target disk and the backup state of the target disk.

In an alternative embodiment, fig. 4 is a flowchart of a method for processing a failed disk in an alternative distributed storage system according to an embodiment of the present invention, as shown in fig. 4, if the redundancy level of the target disk meets a drop-off condition, performing, by a disk processing program, a drop-off process on the target disk includes:

Step S402, under the condition that the data read-write is not affected after the target disk is offline, determining that the redundancy level of the target disk meets the offline condition;

step S404, modifying the attribute state of the target disk to be in a offline state, and performing offline processing on the target disk whose attribute state is in the offline state by a disk processing program.

In an alternative embodiment, if the data read-write is not affected after the target disk is offline, for example, the data information of the target disk is backed up, it is determined that the redundancy level of the data information contained in the target disk meets the offline condition; if the data read-write is affected after the target disk is offline, determining that the redundancy level of the data information contained in the target disk does not meet the offline condition.

In an alternative embodiment, the situation that the data read-write is not affected after the target disk is offline may be implemented, but is not limited to, the following steps:

step S2062, if the failure monitoring module in the distributed storage system detects that the target disk fails, backing up the data information of the target disk with disks other than the target disk, or,

in step S2064, if the probability that the target disk cannot normally operate is greater than the preset probability, the data information of the target disk is backed up to the disk other than the target disk in the distributed storage system.

In an alternative embodiment, it may be determined, but not limited to, that the target disk is not affected by data reading and writing after being offline in at least one of the following manners:

when a fault monitoring module in a distributed storage system monitors that a target disk has faults, data information of the target disk is backed up by adopting disks except the target disk, at the moment, the redundancy level of the data information of the target disk can be determined to be in accordance with the offline condition, the data reading and writing cannot be affected after the target disk is offline, and the offline processing is directly carried out on the target disk.

The other is that the target disk is still in a normal working state, but there is a potential risk that the target disk cannot work normally, for example, if the probability that the target disk cannot work normally is greater than a preset probability, the approval program may actively backup the data information in the target disk to other storage nodes, for example, backup the data information in a disk other than the target disk in the distributed storage system, so as to ensure that the redundancy level of the data information of the target disk meets the offline condition.

Based on the alternative scheme provided by the embodiment of the application, the data read-write is not affected after the target disk is offline through the examination and approval program detection, and if the redundancy level of the data information of the target disk accords with the offline condition, the data read-write is not affected after the target disk is offline, so that the technical effect that the data information read-write of the target disk is not affected after the target disk is offline can be realized.

In an alternative embodiment, fig. 5 is a flowchart of a method for processing a failed disk in an alternative distributed storage system according to an embodiment of the present invention, where, as shown in fig. 5, after a target disk in the distributed storage system is detected to have a failure, an offline task for the target disk is generated, the method further includes:

step S502, determining a backup strategy according to the data influence range of the target disk after offline;

step S504, the data information of the target disk is backed up to the disk except the target disk in the distributed storage system by adopting the backup strategy.

In the embodiment of the application, the data influence range is calculated based on the redundancy of the data information.

Optionally, the data influence range is a data information read-write range influenced by the fault disk after the fault disk is disconnected, and the data information read-write range can be calculated according to redundancy of data information contained in the target disk.

It should be noted that, in the embodiment of the present application, if the redundancy level of the data information is higher, the corresponding data influence range is smaller; the lower the redundancy level of the data information is, the larger the corresponding data influence range is.

In an alternative embodiment, if the influence range is larger, which indicates that the redundancy level of the data information is lower, and the data information of the target disk needs to be backed up, a backup policy for backing up the data information of the target disk may be determined. In another alternative embodiment, if the influence range is smaller, which indicates that the redundancy level of the data information is higher, and the data information of the target disk is not needed to be backed up, a backup policy that the data information of the target disk is not needed to be backed up may be determined.

In the above alternative embodiment, since the target disk fails and needs to be processed offline, the data information included in the target disk may be backed up to other disks, where the other disks are disks in the distributed storage system except for the target disk.

In an alternative embodiment, after the data information of the target disk has been backed up, modifying the attribute state of the target disk to an offline state; and performing offline processing on the target disk with the attribute state being the offline state through the disk processing program.

In the above optional embodiment, in the case that the data information of the target disk is backed up, the approval program may modify the attribute state of the target disk to be a offline state, so as to ensure that the condition of restarting the program is encountered in time, and the target disk will not be misused; and simultaneously, the approval program modifies the state of the offline task of the target disk in the configuration management database into an offline state.

In an alternative embodiment, after the offline processing is performed on the target disk whose attribute state is the offline state by the disk processing program, the method further includes:

step S602, removing the target disk in the offline state from the distributed storage system.

In an alternative embodiment, the disk processing program may remove the target disk whose attribute state is the offline state from the distributed storage system in the case where it is confirmed that the offline task of the target disk can be processed, and optionally, may change the state of the offline task of the target disk to a target state, for example, a completed state, and may be used to indicate that the offline task is completed.

As an alternative embodiment, after the target disk in the offline state is removed from the distributed storage system, in the alternative embodiment, the disk processing program simultaneously generates and initiates a work order, and notifies a worker to dial the target disk at any time, so as to repair the target disk.

In the following, an optional implementation manner is used to explain a method for processing a failed disk in a distributed storage system provided by an embodiment of the present application, so as to facilitate understanding of the embodiment of the present application, and fig. 6 is a flowchart of a method for processing a failed disk in an optional distributed storage system according to an embodiment of the present invention, where, as shown in fig. 6, the method may be implemented, but is not limited to, by the following method steps:

step S60, determining a failure determination rule for detecting a disk failure.

It should be noted that, the execution body of the embodiment of the present application may be, but is not limited to, a distributed storage system, and is configured to process a failed disk in the distributed storage system, so as to achieve the purpose of ensuring data security when the disk in the distributed storage system is failed.

In an alternative embodiment, the rule is a predefined failure determination rule of the distributed storage system for detecting a disk failure, where the failure determination rule includes at least one of: the state of data reading and writing in the magnetic disk and the working state of each magnetic disk scanned by the magnetic disk detection program.

Step S61, the working state of each disk is scanned, and whether each disk has faults or not is determined according to the fault judging rules.

For example, by detecting the status of data reading and writing on the disk, it is determined whether data can be successfully read and written, whether there is a read failure or a write failure, etc.; and determining whether each disk has faults according to the working states of each disk scanned by a disk detection program running in the distributed storage system and the fault judgment rule.

Step S62, under the condition that the fault of a target disk under the distributed storage system is detected, reporting the fault event of the target disk to a configuration management database;

step S63, generating a offline task of the target disk according to the event, and issuing the offline task.

In the above alternative embodiment, a disk detection program running in the distributed storage system is used to detect whether the target disk has a fault, for example, the disk detection program may, but is not limited to, constantly scan the states of the target disks, and report, when detecting that the states of the target disks have an abnormality, an event that the target disk has a fault to the configuration management database, so that the configuration management database records the event.

Step S64, determining the processing priority of the offline task.

In the embodiment of the present application, the task to be processed currently may be obtained from the configuration management database according to, but not limited to, an approval program running in the distributed storage system, and the processing priority of each task may be determined.

Step S65, when the offline task is processed, determining the redundancy level of the target disk.

Step S66, if the redundancy level of the target disk meets the offline condition, the target disk is processed offline by a disk processing program.

Step S67, the target disk in the offline state is removed from the distributed storage system.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the above-mentioned methods of the various embodiments of the present invention.

Example 2

According to an embodiment of the present invention, there is further provided an embodiment of an apparatus for implementing the method for processing a failed disk in a distributed storage system, and fig. 7 is a schematic diagram of an apparatus for processing a failed disk in a distributed storage system according to an embodiment of the present invention, as shown in fig. 7, where the apparatus 700 includes: a generation module 702, a first determination module 704, a second determination module 706, and a processing module 708, wherein:

a generating module 702, configured to generate an offline task for a target disk in a distributed storage system when detecting that the target disk has a failure; a first determining module 704, configured to determine a processing priority of the offline task; a second determining module 706, configured to determine a redundancy level of the target disk when it is time to process the offline task; and a processing module 708, configured to perform offline processing on the target disk through a disk processing program if the redundancy level of the target disk meets an offline condition.

Here, the generating module 702, the first determining module 704, the second determining module 706, and the processing module 708 correspond to steps S202 to S208 in embodiment 1, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In addition, it should be still noted that, the optional or preferred implementation manner of this embodiment may be referred to the related description in embodiment 1, and will not be repeated here.

Example 3

There is further provided in accordance with an embodiment of the present invention, an embodiment of a computer system configured to perform a method for handling a failed disk in any one of the optional or preferred distributed storage systems of embodiment 1, where the computer system includes: a processor and a memory, wherein:

a processor; and a memory, coupled to the processor, for providing instructions to the processor to process the steps of: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Example 4

Embodiments of the present invention may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the computer terminal may execute the program code of the following steps in the method for processing a failed disk in the distributed storage system of the application program: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Alternatively, fig. 8 is a block diagram of a computer terminal according to an embodiment of the present invention, and as shown in fig. 8, the computer terminal 800 may include: one or more (only one is shown) processors 802, memory 804, and a peripheral interface 806.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for processing a failed disk in a distributed storage system in the embodiment of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implements the method for processing a failed disk in a distributed storage system. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, which may be connected to the computer terminal 800 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Optionally, the above processor may further execute program code for: and scanning the working state of each disk, and determining whether each disk has a fault according to a fault judging rule.

Optionally, the above processor may further execute program code for: reporting the event that the target disk has faults to a configuration management database; and generating the offline task of the target disk according to the event, and issuing the offline task.

Optionally, the above processor may further execute program code for: determining a backup strategy according to the data influence range of the target disk after offline; and backing up the data information of the target disk to the disk except the target disk in the distributed storage system by adopting the backup strategy.

Optionally, the above processor may further execute program code for: determining a task to be processed in a task operation room under the distributed storage system; determining the data influence range of the target disk after offline; determining the backup state of the target disk; and determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

Optionally, the above processor may further execute program code for: under the condition that data reading and writing cannot be affected after the target disk is offline, determining that the redundancy level of the target disk meets the offline condition; modifying the attribute state of the target disk into a offline state, and performing offline processing on the target disk with the attribute state of the offline state through a disk processing program.

Optionally, the above processor may further execute program code for: if the fault monitoring module in the distributed storage system monitors that the target disk has faults, backing up the data information of the target disk by adopting a disk except the target disk, or backing up the data information of the target disk in the disk except the target disk in the distributed storage system under the condition that the probability that the target disk cannot work normally is larger than a preset probability.

Optionally, the above processor may further execute program code for: and removing the target disk in the offline state from the distributed storage system.

The embodiment of the invention provides a scheme for processing a fault disk in a distributed storage system. Generating a offline task for a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; if the redundancy level of the target disk meets the offline condition, the target disk is processed offline through a disk processing program, so that the aim of ensuring data security during disk fault processing in a distributed storage system is fulfilled, and the technical problem that the data security is difficult to ensure during disk fault processing in the distributed storage system in the prior art is solved.

It will be appreciated by those skilled in the art that the configuration shown in fig. 8 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm-phone computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 8 is not limited to the structure of the electronic device. For example, the computer terminal 800 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 8, or have a different configuration than shown in fig. 8.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Example 5

The embodiment of the invention also provides a storage medium. Alternatively, in this embodiment, the storage medium may be used to store program codes executed by the method for processing a failed disk in the distributed storage system provided in embodiment 1.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: and scanning the working state of each disk, and determining whether each disk has a fault according to a fault judging rule.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: reporting the event that the target disk has faults to a configuration management database; and generating the offline task of the target disk according to the event, and issuing the offline task.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining a backup strategy according to the data influence range of the target disk after offline; and backing up the data information of the target disk to the disk except the target disk in the distributed storage system by adopting the backup strategy.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining a task to be processed in a task operation room under the distributed storage system; determining the data influence range of the target disk after offline; determining the backup state of the target disk; and determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: under the condition that data reading and writing cannot be affected after the target disk is offline, determining that the redundancy level of the target disk meets the offline condition; modifying the attribute state of the target disk into a offline state, and performing offline processing on the target disk with the attribute state of the offline state through a disk processing program.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: if the fault monitoring module in the distributed storage system monitors that the target disk has faults, backing up the data information of the target disk by adopting a disk except the target disk, or backing up the data information of the target disk in the disk except the target disk in the distributed storage system under the condition that the probability that the target disk cannot work normally is larger than a preset probability.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: and removing the target disk in the offline state from the distributed storage system.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for processing a fault disk in a distributed storage system comprises the following steps:

generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected;

determining the processing priority of the offline task;

determining a redundancy level of the target disk when the offline task is processed based on the processing priority;

if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program;

under the condition that data reading and writing cannot be affected after the target disk is offline, determining that the redundancy level of the target disk meets the offline condition; the condition that the data read-write cannot be affected after the target disk is offline includes: if the fault monitoring module in the distributed storage system monitors that the target disk has faults, backing up the data information of the target disk by adopting a disk except the target disk, or backing up the data information of the target disk to the disk except the target disk in the distributed storage system under the condition that the probability that the target disk cannot work normally is larger than the preset probability.

2. The processing method of claim 1, wherein prior to detecting that a target disk under the distributed storage system is faulty, the method further comprises:

3. The processing method of claim 1, wherein generating the offline task for the target disk under the distributed storage system if a failure of the target disk is detected comprises:

reporting an event that the target disk has a fault to a configuration management database;

and generating a offline task of the target disk according to the event, and issuing the offline task.

4. The processing method according to claim 1, wherein, in a case where a failure of a target disk under a distributed storage system is detected, after generating an offline task for the target disk, the method further comprises:

determining a backup strategy according to the data influence range of the target disk after offline;

and backing up the data information of the target disk to the disk except the target disk in the distributed storage system by adopting the backup strategy.

5. The processing method of claim 1, wherein determining the processing priority of the offline task comprises:

determining a task to be processed in a task operation room under the distributed storage system;

determining the data influence range of the target disk after offline;

determining the backup state of the target disk;

and determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

6. The processing method according to claim 1, wherein if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk by a disk processing program includes:

modifying the attribute state of the target disk into a offline state, and performing offline processing on the target disk with the attribute state of the offline state through a disk processing program.

7. The processing method according to claim 1, wherein after the target disk whose attribute state is the offline state is subjected to the offline processing by the disk processing program, the method further comprises:

and removing the target disk in the offline state from the distributed storage system.

8. A device for handling failed disks in a distributed storage system, comprising:

the system comprises a generation module, a storage module and a processing module, wherein the generation module is used for generating an offline task aiming at a target disk in a distributed storage system under the condition that the fault exists in the target disk;

the first determining module is used for determining the processing priority of the offline task;

the second determining module is used for determining the redundancy level of the target disk when the offline task is processed based on the processing priority;

the processing module is used for carrying out offline processing on the target disk through a disk processing program if the redundancy level of the target disk meets the offline condition;

the second determining module is further configured to determine that the redundancy level of the target disk meets a offline condition under a condition that data read-write is not affected after the target disk is offline; the condition that the data read-write cannot be affected after the target disk is offline includes: if the fault monitoring module in the distributed storage system monitors that the target disk has faults, backing up the data information of the target disk by adopting a disk except the target disk, or backing up the data information of the target disk to the disk except the target disk in the distributed storage system under the condition that the probability that the target disk cannot work normally is larger than the preset probability.

9. A storage medium comprising a stored program, wherein the program, when run, controls a device on which the storage medium resides to perform the steps of: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining a redundancy level of the target disk when the offline task is processed based on the processing priority; if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program; under the condition that data reading and writing cannot be affected after the target disk is offline, determining that the redundancy level of the target disk meets the offline condition; the condition that the data read-write cannot be affected after the target disk is offline includes: if the fault monitoring module in the distributed storage system monitors that the target disk has faults, backing up the data information of the target disk by adopting a disk except the target disk, or backing up the data information of the target disk to the disk except the target disk in the distributed storage system under the condition that the probability that the target disk cannot work normally is larger than the preset probability.

10. A computer system, comprising:

a processor; and

a memory, coupled to the processor, for providing instructions to the processor to process the following processing steps: generating a offline task aiming at a target disk in a distributed storage system under the condition that the fault of the target disk is detected; determining the processing priority of the offline task; determining a redundancy level of the target disk when the offline task is processed based on the processing priority; if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program; under the condition that data reading and writing cannot be affected after the target disk is offline, determining that the redundancy level of the target disk meets the offline condition; the condition that the data read-write cannot be affected after the target disk is offline includes: if the fault monitoring module in the distributed storage system monitors that the target disk has faults, backing up the data information of the target disk by adopting a disk except the target disk, or backing up the data information of the target disk to the disk except the target disk in the distributed storage system under the condition that the probability that the target disk cannot work normally is larger than the preset probability.