CN110968456A

CN110968456A - Method and device for processing fault disk in distributed storage system

Info

Publication number: CN110968456A
Application number: CN201811156593.5A
Authority: CN
Inventors: 王勇; 王鹏; 闫宁; 林江彬
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2020-04-07
Anticipated expiration: 2038-09-30
Also published as: CN110968456B

Abstract

The invention discloses a method and a device for processing a fault disk in a distributed storage system. Wherein, the method comprises the following steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining a redundancy level of the target disk when it is time to process the offline task based on the processing priority; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program. The invention solves the technical problem that the data security is difficult to ensure when the disk failure in the distributed storage system is processed in the prior art.

Description

Method and device for processing fault disk in distributed storage system

Technical Field

The invention relates to the field of fault processing of a distributed storage system, in particular to a method and a device for processing a fault disk in the distributed storage system.

Background

In the case of a large-scale deployment of a distributed storage system, even if a small probability of failure occurs (for example, disk annual 1% -2% and machine annual 7%), a considerable amount of data of the disk and the machine may fail every day. These faulty devices affect, on the one hand, the reliability of the system and, on the other hand, the available resources of the system are reduced, which also results in a waste of costs.

To solve this problem, the related art provides a method for automatically handling a failure based on a state machine, but for handling a disk failure, the method is simply summarized as a failure of a machine, and measures are taken to reformat the machine, reinstall an operating system, or replace the entire machine. The processing method of coarse granularity is too coarse for dozens of large-capacity disks of one machine, which can cause mass data copy and bring great influence to the system, and the scheme does not solve the key problems of two disk processing of data backup and data security control.

In addition, the FBAR (film bulk acoustic resonator filter) system in the prior art is a workflow system, and is a single system, for example, when a machine fails, the FBAR system may mark the machine as "to be repaired", and then a repair process is used to perform subsequent operations, and the FBAR system does not consider concurrent processing if there are other operation and maintenance operations, which is easy to cause data availability problem and concurrent efficiency problem. In addition, the FBAR system does not solve the problem of how to guarantee data security in the storage system.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for processing a fault disk in a distributed storage system, which are used for at least solving the technical problem that the data security is difficult to ensure when the disk fault in the distributed storage system is processed in the prior art.

According to an aspect of the embodiments of the present invention, a method for processing a failed disk in a distributed storage system is provided, including: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

According to another aspect of the embodiments of the present invention, there is further provided a device for processing a failed disk in a distributed storage system, including: the generating module is used for generating an offline task aiming at a target disk in the distributed storage system under the condition that the target disk is detected to have a fault; the first determining module is used for determining the processing priority of the offline task; a second determining module, configured to determine a redundancy level of the target disk when it is time to process the offline task; and the processing module is used for performing offline processing on the target disk through a disk processing program if the redundancy level of the target disk meets the offline condition.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus where the storage medium is located is controlled to perform the following steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

According to another aspect of the embodiments of the present invention, there is also provided a computer system including: a processor; and a memory, connected to the processor, for providing instructions to the processor for processing the following processing steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

In the embodiment of the invention, the offline task aiming at the target disk is generated under the condition that the target disk in the distributed storage system is detected to have a fault; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Based on the processing method of the fault disk in the distributed storage system, the offline processing of the fault disk is realized in a disk-oriented processing mode, errors can be avoided in the offline processing of the fault disk based on the processing mode of the disk state, the robustness of the distributed storage system is ensured, the processing reentry of the distributed storage system can be ensured, and the correctness of processing logic is ensured.

Therefore, the purpose of ensuring data security when disk faults are processed in the distributed storage system is achieved, the technical effect of improving the efficiency of offline processing of the fault disk is achieved, and the technical problem that data security is difficult to ensure when the disk faults are processed in the distributed storage system in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal (or a mobile device) for implementing a method for processing a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for handling a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 3 is a flow chart of an alternative method for handling a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 4 is a flow chart of an alternative method for handling a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 5 is a flow chart of an alternative method for handling a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 6 is a flow chart of an alternative method for handling a failed disk in a distributed storage system according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a processing device for a failed disk in a distributed storage system according to an embodiment of the present invention; and

fig. 8 is a block diagram of a computer terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

configuration management database CMDB: the system is used for storing and managing various configuration information of equipment in an enterprise IT framework, is closely connected with all service support and service delivery processes, supports the operation of the processes, exerts the value of the configuration information, and simultaneously ensures the accuracy of data depending on the related processes.

A storage node: refers to a machine for storing application data and may generally include several storage media.

Distributed storage system: refers to a storage system comprising several storage nodes, typically employing distributed algorithms to provide high availability, fault tolerance and high performance.

And (4) failure: refers to an exception condition that may cause the distributed storage system or its functionality to fail.

Event: significant changes in disk state that can be recognized by the distributed system.

Task: may be used to describe transactional work that needs to be processed.

Example 1

There is also provided, in accordance with an embodiment of the present invention, a method embodiment of a method for handling a failed disk in a distributed storage system, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method provided by the embodiment 1 of the present application can be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal (or mobile device) for implementing a processing method of a failed disk in a distributed storage system. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the processing method of a failed disk in the distributed storage system in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, the processing method of a failed disk in the distributed storage system implementing the application programs described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the foregoing operating environment, the present application provides a method for processing a failed disk in a distributed storage system as shown in fig. 2, where fig. 2 is a flowchart of a method for processing a failed disk in a distributed storage system according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following method steps:

step S202, in case that a failure of a target disk in the distributed storage system is detected, generating an offline task for the target disk.

In an alternative embodiment, the condition that the target disk has a failure may include, but is not limited to, any one of the following: intermittent failure: reading and writing can not be successfully carried out; damage of the medium: for example, one or more bits are permanently damaged and cannot be read correctly; write failure: when writing sectors, previously written sectors cannot be either correctly written or retrieved, e.g., power is removed during the writing of sectors; disk crash: the entire disk is permanently unreadable.

It should be noted that, in the embodiment of the present application, the target state of each component may be recorded by a configuration management database in the distributed storage system, and a disk detection program running in the distributed storage system is used to detect whether a fault exists in the target disk, for example, the disk detection program may be but is not limited to continuously scan the state of each target disk, and report an event that the fault exists in the target disk to the configuration management database when it is detected that the state of the target disk is abnormal, the configuration management database records the event, and an event dispatch program running in the distributed storage system generates an offline task for the target disk according to the event.

And step S204, determining the processing priority of the offline task.

In the embodiment of the present application, the processing priority of the offline task may be determined, but is not limited to, according to an approval procedure running in the distributed storage system.

As an optional embodiment, the approval program is used for operation approval in charge of processing the operation and maintenance task, and the approval program may continuously obtain the current task to be processed from the configuration management database and determine the processing priority of each task.

When the task acquired by the approval program is an offline task of the target disk, the approval program may, but is not limited to, comprehensively consider the processing priority of each current task to be processed, and determine the sequence of processing each task and the processing time for processing each task according to the processing priority.

Step S206, when it is time to process the offline task, determining the redundancy level of the target disk.

It should be noted that, when it is time to process the offline task, the approval program may determine whether the redundancy level of the data information of the target disk meets the offline condition by detecting whether the data information of the target disk is backed up; if the redundancy level of the data information of the target disk meets the offline condition, the data information of the target disk is backed up, and the reading and writing of the data information are not affected after the target disk is offline.

Step S208, if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

In the embodiment of the application, under the condition that the redundancy level of the target disk meets the offline condition, the examination and approval program can modify the local attribute state of the target disk into the offline state so as to ensure that the target disk cannot be misused even if the program is restarted in time; and simultaneously, the examination and approval program modifies the offline task state of the target disk in the configuration management database into the offline state.

Furthermore, when the disk processing program running in the distributed storage system confirms that the offline task processing of the target disk can be processed, the target disk with the attribute state being the offline state can be removed from the distributed storage system, meanwhile, the offline task state of the target disk is changed into the completed state, and meanwhile, the disk processing program generates and initiates a work order to inform a worker to dial away the target disk at any time and maintain the target disk.

It should be noted that, each program running in the distributed storage system in the embodiment of the present application may, but is not limited to, perform independent processing only according to whether the target disk is in a certain state, and when any software is pulled up again due to an exception, perform processing according to the current state of the target disk at any time.

In an alternative embodiment, before detecting that the target disk under the distributed storage system has a failure, the method further includes the following method steps:

and scanning the working state of each disk, and determining whether each disk has a fault according to a fault judgment rule.

As an optional embodiment, the distributed storage system in the embodiment of the present application may pre-define a set of failure determination rules for detecting a disk failure, for example, detect a state of data reading and writing on a disk, determine whether data can be successfully read and written, determine whether a read failure or a write failure exists, and the like; and determining whether each disk has a fault according to a fault determination rule by the working state of each disk scanned by a disk detection program running in the distributed storage system.

In an optional embodiment, in step S202, in the case that it is detected that a target disk in the distributed storage system has a failure, generating an offline task for the target disk may be implemented by the following steps:

step S2020, reporting the event that the target disk has the fault to a configuration management database;

step S2022, generating an offline task of the target disk according to the event, and issuing the offline task.

In an optional embodiment, but not limited to, detecting whether the target disk has a failure by using a disk detection program running in the distributed storage system, reporting an event that the target disk has the failure to a configuration management database in the distributed storage system when detecting that the target disk under the distributed storage system has the failure, recording the event by the configuration management database, generating an offline task for the target disk by using an event dispatch program running in the distributed storage system according to the event, and issuing the offline task.

In an alternative embodiment, fig. 3 is a flowchart of a processing method for a failed disk in an alternative distributed storage system according to an embodiment of the present invention, and as shown in fig. 3, the step S204 for determining the processing priority of the offline task may be implemented by the following steps:

step S302, determining a task to be processed in a task operation and maintenance operation room under the distributed storage system;

step S304, determining the data influence range after the target disk is offline;

step S306, determining the backup state of the target disk;

step S308, determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk, and the backup state of the target disk.

In this embodiment, but not limited to, according to an approval program running in a distributed storage system, a current task to be processed is obtained from a task operation and maintenance operation room in the distributed storage system, a data influence range after a target disk is offline and a backup state of the target disk are determined, and a processing priority of the offline task is determined by comprehensively considering the task to be processed, the data influence range of the target disk and the backup state of the target disk.

In an optional embodiment, fig. 4 is a flowchart of a processing method of a failed disk in an optional distributed storage system according to an embodiment of the present invention, and as shown in fig. 4, if the redundancy level of the target disk meets a offline condition, the offline processing on the target disk by a disk processing program includes:

step S402, determining that the redundancy level of the target disk meets the offline condition under the condition that the reading and writing of data are not affected after the target disk is offline;

step S404, modifying the attribute status of the target disk into an offline status, and performing offline processing on the target disk with the attribute status being the offline status through the disk processing program.

In an optional embodiment, if the data reading and writing are not affected after the target disk is offline, for example, the data information of the target disk is backed up, it is determined that the redundancy level of the data information contained in the target disk meets the offline condition; and if the data reading and writing are influenced after the target disk is offline, determining that the redundancy level of the data information contained in the target disk does not meet the offline condition.

In an optional embodiment, the condition that reading and writing of data is not affected after the target disk is offline may be implemented by, but not limited to, the following steps:

step S2062, if the failure monitoring module in the distributed storage system monitors that the target disk fails, the data information of the target disk is backed up by using the disks other than the target disk, or,

step S2064, in a case that the probability that the target disk cannot normally operate is greater than the preset probability, backing up the data information of the target disk to disks in the distributed storage system other than the target disk.

In an optional embodiment, it may be determined, but not limited to, that data reading and writing are not affected after the target disk is offline by at least one of the following two ways:

one is that when a failure monitoring module in a distributed storage system monitors that a target disk fails, a disk other than the target disk is used to backup data information of the target disk, and at this time, it can be determined that the redundancy level of the data information of the target disk meets an offline condition, and after the target disk is offline, data reading and writing are not affected, and the target disk is directly offline.

The other is that the target disk is still in a normal working state, but there is a potential risk that the target disk cannot work normally, for example, under the condition that the probability that the target disk cannot work normally is greater than a preset probability, the approval program may actively backup the data information in the target disk to other storage nodes, for example, to disks in the distributed storage system except for the target disk, so as to ensure that the redundancy level of the data information of the target disk meets an offline condition.

Based on the alternative provided by the embodiment of the application, the data reading and writing cannot be influenced after the target disk is offline through examination and approval program detection, if the redundancy level of the data information of the target disk meets the offline condition, the data reading and writing cannot be influenced after the target disk is offline, and the technical effect that the reading and writing of the data information of the target disk cannot be influenced after the target disk is offline can be achieved.

In an optional embodiment, fig. 5 is a flowchart of a processing method for a failed disk in an optional distributed storage system according to an embodiment of the present invention, and as shown in fig. 5, after generating a offline task for a target disk in the distributed storage system when it is detected that the target disk has a failure, the method further includes:

step S502, determining a backup strategy according to the data influence range after the target disk is offline;

step S504, the data information of the target disk is backed up to the disks in the distributed storage system except for the target disk by using the backup policy.

In the embodiment of the present application, the data influence range is calculated based on the redundancy of the data information.

Optionally, the data influence range is a range of reading and writing of data information influenced by the offline of the failed disk, and may be obtained by calculation according to redundancy of the data information included in the target disk.

In the embodiment of the present application, if the redundancy level of the data information is higher, the corresponding data influence range is smaller; if the redundancy level of the data information is lower, the corresponding data influence range is larger.

In an optional embodiment, if the influence range is larger, which indicates that the redundancy level of the data information is lower, and it is necessary to backup the data information of the target disk, a backup policy for backing up the data information of the target disk may be determined. In another optional embodiment, if the influence range is smaller, which indicates that the redundancy level of the data information is higher, and it is not necessary to backup the data information of the target disk, a backup policy that the data information of the target disk is not necessary to be backed up may be determined.

In the above optional embodiment, since the target disk fails and needs to be offline processed, data information contained in the target disk may be backed up to another disk, where the another disk is a disk in the distributed storage system other than the target disk.

In an optional embodiment, after the data information of the target disk is backed up, the attribute state of the target disk is modified to an offline state; and performing offline processing on the target disk with the attribute state being the offline state through the disk processing program.

In the above optional embodiment, when the data information of the target disk is backed up, the approval program may modify the attribute state of the target disk into an offline state, so as to ensure that the target disk is not misused even if the program is restarted in time; and simultaneously, the examination and approval program modifies the offline task state of the target disk in the configuration management database into the offline state.

In an optional embodiment, after the target disk with the attribute status of offline is offline processed by the disk handler, the method further includes:

step S602, removing the target disk in the offline state from the distributed storage system.

In an optional embodiment, the disk processing program may remove the target disk with the attribute status being the offline status from the distributed storage system when it is determined that the offline task processing of the target disk is capable of being processed, and may optionally change the status of the offline task of the target disk to the target status, for example, a completed status, which may be used to indicate that the offline task is completed.

As an optional embodiment, after the target disk in the offline state is removed from the distributed storage system, in the optional embodiment, the disk processing program generates and initiates a work order at the same time, and notifies a worker to dial the target disk at any time to repair the target disk.

In order to facilitate understanding of the embodiments of the present application, a method for processing a failed disk in a distributed storage system provided in this application is explained below through an optional implementation manner, and fig. 6 is a flowchart of an optional method for processing a failed disk in a distributed storage system according to an embodiment of the present invention, as shown in fig. 6, where the method may also be implemented, but is not limited to, through the following method steps:

in step S60, a failure determination rule for detecting a disk failure is determined.

It should be noted that the execution subject in the embodiment of the present application may be, but is not limited to, a distributed storage system, and is configured to process a failed disk in the distributed storage system, so as to achieve the purpose of ensuring data security when processing a disk failure in the distributed storage system.

In an alternative embodiment, the rule is a failure determination rule predefined by the distributed storage system for detecting a disk failure, and the failure determination rule includes at least one of: the state of reading and writing data in the disk and the working state of each disk scanned by the disk detection program.

Step S61, scanning the operating status of each disk, and determining whether there is a failure in each disk according to the failure determination rule.

For example, by detecting the state of data read/write in the disk, it is determined whether the data can be successfully read/written, whether there is a read failure or a write failure, etc.; and determining whether each disk has a fault according to a fault determination rule by the working state of each disk scanned by a disk detection program running in the distributed storage system.

Step S62, reporting the fault event of the target disk to a configuration management database when detecting that the target disk in the distributed storage system has fault;

and step S63, generating the offline task of the target disk according to the event and issuing the offline task.

In the above optional embodiment, a disk detection program running in the distributed storage system is used to detect whether the target disk has a fault, for example, the disk detection program may but is not limited to continuously scan the state of each target disk, and report an event that the target disk has a fault to the configuration management database when detecting that the state of the target disk is abnormal, so that the configuration management database records the event.

Step S64, determining the processing priority of the offline task.

In the embodiment of the present application, but not limited to, according to an approval program running in the distributed storage system, the current task to be processed is obtained from the configuration management database, and the processing priority of each task is determined.

Step S65, when it is the turn to process the offline task, determining the redundancy level of the target disk.

Step S66, if the redundancy level of the target disk meets the offline condition, the target disk is offline processed by the disk processing program.

Step S67, removing the target disk in the offline state from the distributed storage system.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to an embodiment of the present invention, an apparatus embodiment for implementing the method for processing a failed disk in a distributed storage system is further provided, and fig. 7 is a schematic diagram of an apparatus for processing a failed disk in a distributed storage system according to an embodiment of the present invention, as shown in fig. 7, the apparatus 700 includes: a generation module 702, a first determination module 704, a second determination module 706, and a processing module 708, wherein:

a generating module 702, configured to generate an offline task for a target disk in a distributed storage system when a failure of the target disk is detected; a first determining module 704, configured to determine a processing priority of the offline task; a second determining module 706, configured to determine a redundancy level of the target disk when it is time to process the offline task; the processing module 708 is configured to perform offline processing on the target disk through a disk processing program if the redundancy level of the target disk meets the offline condition.

It should be noted here that the generating module 702, the first determining module 704, the second determining module 706, and the processing module 708 correspond to steps S202 to S208 in embodiment 1, and the four modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules may be operated in the computer terminal 10 provided in embodiment 1 as a part of the apparatus.

In addition, it should be noted that, for alternative or preferred embodiments of the present embodiment, reference may be made to the relevant description in embodiment 1, and details are not described herein again.

Example 3

According to an embodiment of the present invention, there is further provided an embodiment of a computer system, where the computer system is capable of executing the method for processing a failed disk in any optional or preferred distributed storage system in embodiment 1, where the computer system includes: a processor and a memory, wherein:

a processor; and a memory, connected to the processor, for providing instructions to the processor for processing the following processing steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Example 4

The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute the program code of the following steps in the processing method of the failed disk in the distributed storage system of the application program: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Optionally, fig. 8 is a block diagram of a computer terminal according to an embodiment of the present invention, and as shown in fig. 8, the computer terminal 800 may include: one or more processors 802 (only one of which is shown), a memory 804, and a peripheral interface 806.

The memory may be configured to store a software program and a module, such as program instructions/modules corresponding to the method and apparatus for processing a failed disk in the distributed storage system in the embodiment of the present invention, and the processor executes various functional applications and data processing by running the software program and the module stored in the memory, that is, implements the method for processing a failed disk in the distributed storage system. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memories may further include a memory located remotely from the processor, which may be connected to the computer terminal 800 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Optionally, the processor may further execute the program code of the following steps: and scanning the working state of each disk, and determining whether each disk has a fault according to a fault judgment rule.

Optionally, the processor may further execute the program code of the following steps: reporting the fault event of the target disk to a configuration management database; and generating an offline task of the target disk according to the event, and issuing the offline task.

Optionally, the processor may further execute the program code of the following steps: determining a backup strategy according to the data influence range after the target disk is offline; and backing up the data information of the target disk to disks except the target disk in the distributed storage system by adopting the backup strategy.

Optionally, the processor may further execute the program code of the following steps: determining a task to be processed in a task operation and maintenance operation room under the distributed storage system; determining the data influence range after the target disk is offline; determining the backup state of the target disk; and determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

Optionally, the processor may further execute the program code of the following steps: determining that the redundancy level of the target disk meets an offline condition under the condition that data reading and writing are not influenced after the target disk is offline; and modifying the attribute state of the target disk into an offline state, and performing offline processing on the target disk with the attribute state being the offline state through a disk processing program.

Optionally, the processor may further execute the program code of the following steps: if the fault monitoring module in the distributed storage system monitors that the target disk has a fault, the data information of the target disk is backed up by using disks except the target disk, or the data information of the target disk is backed up to the disks except the target disk in the distributed storage system under the condition that the probability that the target disk cannot normally work is greater than the preset probability.

Optionally, the processor may further execute the program code of the following steps: and removing the target disk in the offline state from the distributed storage system.

The embodiment of the invention provides a scheme for processing a fault disk in a distributed storage system. Generating an offline task aiming at a target disk in a distributed storage system under the condition that the target disk is detected to have a fault; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program, so that the purpose of ensuring data security is achieved when disk faults in the distributed storage system are processed, and the technical problem that the data security is difficult to ensure when the disk faults in the distributed storage system are processed in the prior art is solved.

It can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 8 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 800 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 5

The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the processing method for the failed disk in the distributed storage system provided in embodiment 1.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining the redundancy level of the target disk when the offline task is processed; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: and scanning the working state of each disk, and determining whether each disk has a fault according to a fault judgment rule.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: reporting the fault event of the target disk to a configuration management database; and generating an offline task of the target disk according to the event, and issuing the offline task.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: determining a backup strategy according to the data influence range after the target disk is offline; and backing up the data information of the target disk to disks except the target disk in the distributed storage system by adopting the backup strategy.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: determining a task to be processed in a task operation and maintenance operation room under the distributed storage system; determining the data influence range after the target disk is offline; determining the backup state of the target disk; and determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: determining that the redundancy level of the target disk meets an offline condition under the condition that data reading and writing are not influenced after the target disk is offline; and modifying the attribute state of the target disk into an offline state, and performing offline processing on the target disk with the attribute state being the offline state through a disk processing program.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: if the fault monitoring module in the distributed storage system monitors that the target disk has a fault, the data information of the target disk is backed up by using disks except the target disk, or the data information of the target disk is backed up to the disks except the target disk in the distributed storage system under the condition that the probability that the target disk cannot normally work is greater than the preset probability.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: and removing the target disk in the offline state from the distributed storage system.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for processing a failed disk in a distributed storage system comprises the following steps:

under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk;

determining the processing priority of the offline task;

determining a redundancy level of the target disk when it is time to process the offline task based on the processing priority;

and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

2. The process of claim 1, wherein prior to detecting a failure of a target disk under a distributed storage system, the process further comprises:

3. The processing method according to claim 1, wherein in the case that a failure of a target disk in the distributed storage system is detected, generating an offline task for the target disk comprises:

reporting the fault event of the target disk to a configuration management database;

and generating an offline task of the target disk according to the event, and issuing the offline task.

4. The processing method according to claim 1, wherein after generating a downline task for a target disk in the distributed storage system in a case where a failure of the target disk is detected, the method further comprises:

determining a backup strategy according to the data influence range after the target disk is offline;

and backing up the data information of the target disk to disks except the target disk in the distributed storage system by adopting the backup strategy.

5. The processing method of claim 1, wherein determining the processing priority of the downline task comprises:

determining a task to be processed in a task operation and maintenance operation room under the distributed storage system;

determining the data influence range after the target disk is offline;

determining the backup state of the target disk;

and determining the processing priority of the offline task based on the task to be processed, the data influence range of the target disk and the backup state of the target disk.

6. The processing method according to claim 1, wherein, if the redundancy level of the target disk meets an offline condition, performing offline processing on the target disk by using a disk processing program comprises:

determining that the redundancy level of the target disk meets an offline condition under the condition that data reading and writing are not affected after the target disk is offline;

and modifying the attribute state of the target disk into an offline state, and performing offline processing on the target disk with the attribute state being the offline state through a disk processing program.

7. The processing method according to claim 6, wherein the condition that reading and writing of data is not affected after the target disk is offline includes:

if the fault monitoring module in the distributed storage system monitors that the target disk has a fault, the data information of the target disk is backed up by adopting disks except the target disk,

or, under the condition that the probability that the target disk cannot normally work is greater than a preset probability, backing up the data information of the target disk to disks in the distributed storage system except the target disk.

8. The processing method of claim 6, wherein after the target disk whose attribute status is down processed by the disk handler, the method further comprises:

and removing the target disk in the offline state from the distributed storage system.

9. A device for handling a failed disk in a distributed storage system, comprising:

the generating module is used for generating an offline task aiming at a target disk in the distributed storage system under the condition that the target disk is detected to have a fault;

the first determining module is used for determining the processing priority of the offline task;

a second determining module for determining a redundancy level of the target disk when it is a turn to process the offline task based on the processing priority;

and the processing module is used for performing offline processing on the target disk through a disk processing program if the redundancy level of the target disk meets the offline condition.

10. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the steps of: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining a redundancy level of the target disk when it is time to process the offline task based on the processing priority; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.

11. A computer system, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: under the condition that a target disk in a distributed storage system is detected to have a fault, generating an offline task aiming at the target disk; determining the processing priority of the offline task; determining a redundancy level of the target disk when it is time to process the offline task based on the processing priority; and if the redundancy level of the target disk meets the offline condition, performing offline processing on the target disk through a disk processing program.