CN114546285A

CN114546285A - Storage system fault maintenance method, device, equipment and storage medium

Info

Publication number: CN114546285A
Application number: CN202210181686.3A
Authority: CN
Inventors: 孙凤超
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-02-26
Filing date: 2022-02-26
Publication date: 2022-05-27

Abstract

The present invention relates to the field of storage, and in particular, to a method, an apparatus, a device, and a storage medium for maintaining a storage system failure. The method comprises the following steps: reading a register of the storage system to acquire register information for identifying the health state of each component of the storage system; analyzing the register information to determine fault information; judging a fault type based on the fault information, and determining a hard disk to be maintained according to the fault type; initiating and executing a maintenance task based on the determined hard disk needing maintenance to add a move-out storage system identification to the hard disk needing maintenance. The scheme of the invention realizes real-time monitoring of the storage system, automatically moves the hard disk with problems out of the storage system, avoids manual analysis and kicking of the failed hard disk, improves the maintenance efficiency of the storage system and greatly saves the manual maintenance cost.

Description

Storage system fault maintenance method, device, equipment and storage medium

Technical Field

The present invention relates to the field of storage, and in particular, to a method, an apparatus, a device, and a storage medium for maintaining a storage system failure.

Background

With the explosive growth of massive unstructured data, distributed storage has become a storage base stone for users to construct a data center architecture, and more key services are accessed into the distributed storage. Due to the large number of applications of distributed storage, the operation and maintenance pressure of a storage system is continuously increased. How to rapidly position a fault hardware to implement and replace the fault hardware into a core problem of mass storage operation and maintenance at present under the condition that core components such as a disk, a central processing unit, a memory and the like have faults on the premise of ensuring that cluster services are not interrupted.

At present, the mainstream processing method is that after a system monitors a hardware fault, an alarm log is generated to remind a user that a replacement and repair operation needs to be performed. And the user manually kicks the fault node or the hard disk out of the system, so that the service is prevented from continuously writing data into the fault node and the magnetic disk, and the continuity of the service is ensured. Because the mass storage cluster is large in scale in most cases, how to quickly find the problem node and the disk becomes a pain point and a difficulty point for operation and maintenance personnel.

Disclosure of Invention

In view of the above, it is necessary to provide a storage system fault maintenance method, apparatus, device and storage medium for solving the problem that the existing storage system simply depends on manual maintenance.

According to a first aspect of the present invention, there is provided a storage system fault maintenance method, the method comprising:

reading a register of the storage system to obtain register information for identifying the health state of each component of the storage system;

analyzing the register information to determine fault information;

judging a fault type based on the fault information, and determining a hard disk to be maintained according to the fault type;

initiating and executing a maintenance task based on the determined hard disk needing maintenance to add a move-out storage system identification to the hard disk needing maintenance.

In some embodiments, the failure information includes storage node failure information and hard disk failure information;

the step of judging the fault type based on the fault information and determining the hard disk to be maintained according to the fault type comprises the following steps:

responding to the hard disk fault information, and determining that the fault type is a disk fault;

determining the fault type to be a node fault in response to the fault information being storage node fault information;

if the response fault type is a disk fault, taking the hard disk corresponding to the fault information as the hard disk to be maintained;

and in response to the fault type being the node fault, taking all the hard disks included in the fault node as hard disks needing to be maintained.

In some embodiments, the method further comprises:

detecting whether a hard disk to be maintained can work normally;

and in response to confirming that the fault node and/or the fault hard disk can work normally, initiating and executing a shutdown maintenance task to delete the moved-out storage system identification of the hard disk which is recovered to be normal.

In some embodiments, the method further comprises:

and sending a lighting command to light a fault lamp of the hard disk needing maintenance in response to the completion of the execution of the maintenance task.

In some embodiments, the method further comprises:

and sending a light-off command to turn off a fault light of the recovered normal hard disk in response to the completion of the execution of the maintenance task.

In some embodiments, the method further comprises;

and responding to the determined fault information, generating alarm information based on the fault information and displaying the alarm information through a UI (user interface).

In some embodiments, the method further comprises:

automatic alarm information is configured in advance, wherein the automatic alarm information comprises a mailbox address and/or a telephone number;

responding to the generated alarm information, and judging whether to start automatic alarm or not;

and in response to the confirmation that the user opens the automatic alarm, sending the alarm information to a pre-configured mailbox address in a mail mode and/or sending the alarm information to a pre-configured telephone number in a short message mode.

According to a second aspect of the present invention, there is provided a storage system failure maintenance apparatus, the apparatus comprising:

the reading module is configured for reading a register of the storage system to acquire register information for identifying the health state of each component of the storage system;

the fault analysis module is configured to analyze the register information to determine fault information;

the determining module is configured to judge a fault type based on the fault information and determine a hard disk to be maintained according to the fault type;

and the identification module is configured to initiate and execute a maintenance task based on the determined hard disk needing maintenance so as to add and remove the storage system identification to the hard disk needing maintenance.

According to a third aspect of the present invention, there is also provided a computer apparatus comprising:

at least one processor; and

a memory storing a computer program operable on a processor, the processor executing the program to perform the aforementioned storage system fault maintenance method, the method comprising:

analyzing the register information to determine fault information;

According to a fourth aspect of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, performs the aforementioned storage system fault maintenance method, the method comprising:

analyzing the register information to determine fault information;

According to the storage system fault maintenance method, the register information used for identifying the health state of each component of the storage system is obtained by reading the register of the storage system, the fault information is further analyzed, then the fault type is judged by utilizing the fault information, the hard disk needing to be maintained is determined, and finally, a maintenance task is initiated and executed based on the determined hard disk needing to be maintained to add the identification of the storage system to the hard disk needing to be maintained, so that the storage system is monitored in real time, the hard disk with problems is automatically moved out of the storage system, the hard disk with the faults is prevented from being manually analyzed and kicked out, the maintenance efficiency of the storage system is improved, and the manual maintenance cost is greatly saved.

In addition, the storage system fault maintenance device, the computer device and the computer readable storage medium provided by the invention can also achieve the technical effects, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a storage system fault maintenance method 100 according to an embodiment of the present invention;

FIG. 2 is a logic diagram for automatically marking and locating a failed hard disk in a storage system according to yet another embodiment of the present invention;

fig. 3 is a schematic structural diagram of a storage system fault maintenance apparatus 200 according to another embodiment of the present invention;

fig. 4 is an internal structural view of a computer device in another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In one embodiment, referring to fig. 1, the present invention provides a storage system fault maintenance method 100, which includes the following steps:

step 101, reading a register of a storage system to obtain register information for identifying the health state of each component of the storage system;

in this embodiment, the storage system includes a plurality of storage nodes, each storage node includes at least one hard disk, and the register can record a register of node information and a register of hard disk information, for example, a certain register records data related to a read/write rate of the hard disk, or a certain register records data related to a central processing unit of a certain storage node.

102, analyzing the register information to determine fault information;

in this embodiment, the failure information may include any existing hard disk failure information or failure information of a node device, for example, a read/write rate of a certain hard disk is too low, a temperature of a certain central processing unit is too high, and the like.

103, judging a fault type based on the fault information, and determining a hard disk to be maintained according to the fault type;

and 104, initiating and executing a maintenance task based on the determined hard disk needing maintenance to add a removal storage system identifier to the hard disk needing maintenance.

In this embodiment, the removed storage system identifier is denoted as noout, which indicates that the hard disk may still be running but no longer accept data requests, and at this time, it does not belong to a pool Map (Controlled Replication Under Scalable Hashing Map, which is equivalent to a tree-structured data distribution Map of a distributed storage cluster, and has at most 10 hierarchies; through which the pool algorithm can know how data should be distributed), so the removed storage system identifier can be used to prevent the hard disk from accepting data requests.

in step 103, the determining a fault type based on the fault information and determining the hard disk to be maintained according to the fault type specifically includes:

In some embodiments, the method further comprises the following steps

Detecting whether a hard disk to be maintained can work normally;

In some embodiments, the method further comprises:

In some embodiments, the method further comprises;

In some embodiments, the method further comprises:

In some embodiments, to facilitate understanding of the technical solution of the present invention, the following respectively describes a maintenance process of a storage system with a disk failure and a node failure, please refer to fig. 2, and the specific implementation manner is as follows:

in the first case: disk failure

1) Reading a register for recording disk hardware related information, and reporting the register information;

2) processing the register information to generate fault information, and then analyzing the fault information report so as to give an alarm;

3) the specific alarm mode is that the fault information is utilized to generate alarm information and transmit a display interface, so that the alarm information is displayed on the UI interface, and if a user configures mail alarm or short message alarm, a mail or a short message needs to be sent to a designated mailbox or a mobile phone number during data display;

4) if the fault is confirmed to be a disk fault based on the alarm information, a maintenance task is triggered;

5) when the maintenance task is executed, marking the osd carried by the fault magnetic disk as noout, starting a maintenance mode, and isolating the osd from the service and waiting for disk replacement in the maintenance mode;

6) the maintenance task reports the execution condition in real time and displays the execution condition to the user through a UI interface, so that operation and maintenance personnel can know the task state in time conveniently;

7) a lighting command is issued to light up a fault disk, so that operation and maintenance personnel can conveniently and quickly position the fault disk;

8) after the disk is replaced, the corresponding osd is closed to the maintenance mode, the noout mark is removed, and the osd is recovered to normal work;

9) and sending out a lamp-turning-off command to end the maintenance mode.

In the second case: node failure

1) Reading a register for recording the hardware related information of the node, and reporting the register information;

2) processing the register information to generate fault information, and reporting the fault information for warning;

3) the specific alarm mode is that alarm information is generated according to fault information and is transmitted to a display interface, so that the alarm information is displayed on the UI interface, and if a user configures mail alarm or short message alarm, a data display module needs to send a mail or a short message to a designated mailbox or a mobile phone number;

4) if the fault is confirmed to be a node fault based on the alarm information, if the fault affects the whole service of the node (such as CUP and memory fault), a maintenance task is triggered;

5) marking all osds of the fault node as noout, starting a maintenance mode, isolating the node from the service in the maintenance mode, and waiting for maintenance;

7) issuing a lighting command to light fault lamps of all hard disks under the node, so that operation and maintenance personnel can conveniently and quickly position;

8) after the nodes are maintained (such as a better CPU), the nodes can work normally, the maintenance mode of the corresponding nodes is closed, the noout marks are removed, and the nodes work normally;

9) and the task module issues a light-out command to turn off all hard disk fault lights of the node, and the maintenance mode is ended.

The method adopts an interface operation mode to monitor the system state in real time, if a system hardware fault is found, operation and maintenance personnel are reminded in time through an interface alarm mode, meanwhile, a storage system identifier is automatically added and removed from a fault or an unavailable disk, and a disk maintenance mode is started. In addition, in order to facilitate operation and maintenance personnel to find the fault node and the disk in the machine room in time, the system also executes lighting operation on the fault node and the disk, and after the operation and maintenance personnel finish the operation and maintenance operation of the fault disk or the node, the storage system identifier is removed and moved out, so that the maintenance mode is closed.

In another embodiment, referring to fig. 3, the present invention further provides a storage system failure maintenance apparatus 200, which includes:

the reading module 201 is configured to read a register of the storage system to obtain register information for identifying health states of components of the storage system;

a fault analysis module 202 configured to analyze the register information to determine fault information;

the determining module 203 is configured to determine a fault type based on the fault information, and determine a hard disk to be maintained according to the fault type;

an identification module 204 configured to initiate and perform a maintenance task based on the determined hard disk needing maintenance to add an out-of-storage-system identification to the hard disk needing maintenance.

Above-mentioned storage system fault maintenance device, register information for the healthy state of each part of sign storage system is obtained through reading storage system's register, and further analysis goes out fault information, thereby utilize fault information to judge the fault type and confirm the hard disk that needs to maintain, at last based on the hard disk that needs to maintain initiate and carry out the maintenance task and shift out storage system sign in order to add the hard disk that needs to maintain, real time monitoring storage system has been realized, and move out storage system with the hard disk that has the problem automatically, the hard disk of artifical analysis and play the trouble has been avoided, storage system's maintenance efficiency has been improved, the very big maintenance cost of having saved the manual work.

the determination module is further to:

In some embodiments, the apparatus further comprises means for:

detecting whether a hard disk to be maintained can work normally;

and in response to confirming that the fault node and/or the fault hard disk can work normally, initiating and executing a shutdown maintenance task to delete the moved-out storage system identifier of the hard disk which is recovered to be normal.

In some embodiments, the apparatus further comprises means for:

In some embodiments, the method further comprises:

In some embodiments, the apparatus further comprises means for performing the following steps;

In some embodiments, the apparatus further comprises means for:

It should be noted that, for specific limitations of the storage system fault maintenance apparatus, reference may be made to the above limitations of the storage system fault maintenance method, and details are not described herein again. The modules in the storage system fault maintenance device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

According to another aspect of the present invention, a computer device is provided, and the computer device may be a server, and its internal structure is shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the storage system failure maintenance method described above, in particular the method comprising the steps of:

analyzing the register information to determine fault information;

In some embodiments, the method further comprises:

detecting whether a hard disk to be maintained can work normally;

In some embodiments, the method further comprises:

In some embodiments, the method further comprises;

In some embodiments, the method further comprises:

According to a further aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the storage system fault maintenance method described above, in particular comprising performing the steps of:

analyzing the register information to determine fault information;

In some embodiments, the method further comprises:

detecting whether a hard disk to be maintained can work normally;

In some embodiments, the method further comprises:

In some embodiments, the method further comprises;

In some embodiments, the method further comprises:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A storage system failure maintenance method, the method comprising:

analyzing the register information to determine fault information;

2. The storage system fault maintenance method according to claim 1, wherein the fault information includes storage node fault information and hard disk fault information;

3. The storage system fault maintenance method of claim 2, wherein the method further comprises:

detecting whether a hard disk to be maintained can work normally;

4. The storage system fault maintenance method of claim 3, wherein the method further comprises:

5. The storage system fault maintenance method of claim 4, wherein the method further comprises:

6. The storage system fault maintenance method of claim 2, wherein the method further comprises;

7. The storage system fault maintenance method according to any one of claims 1 to 6, wherein the method further comprises:

8. A storage system failure maintenance apparatus, the apparatus comprising:

9. A computer device, comprising:

at least one processor; and

a memory storing a computer program operable in the processor, the processor when executing the program performing the method of any of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.