CN114218037A

CN114218037A - Hard disk management method, device, equipment and machine readable storage medium

Info

Publication number: CN114218037A
Application number: CN202111406398.5A
Authority: CN
Inventors: 申瑞
Original assignee: New H3C Technologies Co Ltd Chengdu Branch
Current assignee: New H3C Technologies Co Ltd Chengdu Branch
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-03-22

Abstract

The present disclosure provides a hard disk management method, apparatus, device and machine-readable storage medium, the method comprising: obtaining IO state of the monitored hard disk, filtering and finding out the hard disk with IO overtime; acquiring health state information of the hard disk with IO overtime, and finding out the hard disk with abnormal health state information and IO overtime according to the health state information of the hard disk; and sending a power-off command, wherein the power-off command comprises an instruction for powering off the hard disk with abnormal health state information and IO timeout. According to the technical scheme, the failed hard disk is quickly and accurately found according to the IO overtime state and the abnormal health state information of the hard disk, the hard disk is powered off, the abnormal hard disk is powered off, the influence of the abnormal hard disk on the coupled normal hard disk is avoided continuously, meanwhile, the judgment is carried out according to the health state information of the hard disk, and the situation that the normal hard disk which is influenced by the coupling of the abnormal hard disk is mistakenly judged and powered off only according to the IO overtime judgment is avoided.

Description

Hard disk management method, device, equipment and machine readable storage medium

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a hard disk management method, apparatus, device, and machine-readable storage medium.

Background

To further optimize storage performance, and for data security considerations, we typically combine hard disks using RAID techniques to achieve a better experience.

RAID may be divided into three types, "soft RAID", "semi-soft semi-hard RAID", and "hard RAID", depending on the implementation.

Hardware RAID is architecture independent and requires additional RAID cards. The RAID card has independent memory and CPU, can provide finished data processing capacity, hardly occupies extra CPU and memory resources in the system, and simultaneously provides high storage performance and safety for the system.

In such a scenario, a large number of hard disks are uniformly managed under the same RAID card, which inevitably causes a certain degree of coupling relationship between hard disk individuals that should be independent.

Under normal circumstances, these couplings do not have any negative impact, but in some scenarios such couplings that appear to be unrelated to pain and itch can cause significant problems to the storage system.

For example, when a certain hard disk under the same RAID card is abnormal, due to the failure processing mechanism of the RAID card and the existence of the coupling relationship, IO of all hard disks under its management may be blocked, and the blocking may last for a long time until the abnormal hard disk is recovered or the RAID card recognizes that the disk is bad and proposes a management scope.

Disclosure of Invention

In view of the above, the present disclosure provides a hard disk management method, an apparatus, an electronic device, and a machine-readable storage medium to solve the problem that the normal hard disk is blocked due to coupling for a long time.

The specific technical scheme is as follows:

the present disclosure provides a hard disk management method, which is applied to a hard disk management device, and the method includes: obtaining IO state of the monitored hard disk, filtering and finding out the hard disk with IO overtime; acquiring health state information of the hard disk with IO overtime, and finding out the hard disk with abnormal health state information and IO overtime according to the health state information of the hard disk; and sending a power-off command, wherein the power-off command comprises an instruction for powering off the hard disk with abnormal health state information and IO timeout.

As a technical solution, the sending a power-down command, where the power-down command includes an instruction for powering down a hard disk with abnormal health status information and IO timeout, includes: and sending a hard disk power-off command to the BMC so that the BMC controls the CPLD of the hard disk backboard to carry out power-off operation on the hard disk slot position associated with the hard disk with abnormal hard disk health state information and IO timeout through the I2C bus according to the hard disk power-off command.

As a technical solution, the obtaining of the IO status of the monitored hard disk, filtering and finding out the hard disk with IO timeout includes: and counting the return time of the hard disk after IO issuing, and if the hard disk does not return within the preset time length, determining that IO timeout exists in the hard disk.

As a technical solution, the obtaining hard disk health status information of a hard disk with IO timeout, and finding out a hard disk with IO timeout and abnormal hard disk health status information according to the hard disk health status information includes: polling and monitoring the hard disk health state information of the hard disk; the hard disk health state information comprises medium error, hardware error, not ready and io timeout.

This disclosure provides a hard disk management device simultaneously, is applied to hard disk management equipment, the device includes: the IO module is used for acquiring the IO state of the monitored hard disk, and filtering and finding out the hard disk with IO overtime; the health module is used for acquiring the health state information of the hard disk with IO overtime, and finding out the hard disk with abnormal health state information and IO overtime according to the health state information of the hard disk; and the power-down module is used for sending a power-down command, and the power-down command comprises a hard disk power-down instruction which enables the health state information of the hard disk to be abnormal and IO timeout exists.

The present disclosure also provides an electronic device, which includes a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions capable of being executed by the processor, and the processor executes the machine-executable instructions to implement the foregoing hard disk management method.

The present disclosure also provides a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the aforementioned hard disk management method.

The technical scheme provided by the disclosure at least brings the following beneficial effects:

according to the IO overtime state and the abnormal hard disk health state information, the hard disk with the fault is quickly and accurately found, the hard disk is powered off, the abnormal hard disk is powered off, the influence of the abnormal hard disk on the coupled normal hard disk is avoided continuously, and meanwhile, the judgment is carried out by combining the hard disk health state information, so that the condition that the normal hard disk which is influenced by the abnormal hard disk coupling is judged by mistake and powered off only according to the IO overtime judgment is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present disclosure or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and other drawings can be obtained by those skilled in the art according to the drawings of the embodiments of the present disclosure.

FIG. 1 is a flow chart of a hard disk management method in one embodiment of the present disclosure;

FIG. 2 is a flow chart of a hard disk management method in one embodiment of the present disclosure;

fig. 3 is a hardware configuration diagram of an electronic device in an embodiment of the present disclosure.

Detailed Description

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information in the embodiments of the present disclosure, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

A CPLD (Complex Programming logic device) complex programmable logic device;

IPMI (Intelligent Platform Management interface) intelligent Platform Management interface;

a BMC (baseboard Management controller) substrate Management controller;

RAID (redundant Array of Independent disks) redundant Array of Independent disks;

smart (Self-Monitoring Analysis And Reporting Technology) Self-detection Analysis And Reporting Technology;

an OSD (Object-Based Storage Device);

IOTH (IO Timeout handler) IO Timeout handler.

In one scheme, the IOTH module utilizes a scsi trace mechanism carried by a Linux kernel to monitor the IO states of all scsi disks in the system.

When IO exceeds a specified threshold and does not return, triggering a kernel-defined IO timeout processing flow, and calling a scsi _ dispatch _ cmd _ timeout function by the kernel in the flow to record detailed information of the IO which is overtime.

Meanwhile, the IOTH module in the user mode can capture the information, the host, channel, id and lun values of the logical disk are recorded in the information, and the IO timeout of the logical disk can be determined through the values.

Then, the IOTH module sends the information to the corresponding OSD, so that the OSD can perform down operation in time, and serious service influence caused by long-time suspension of a single OSD is avoided.

However, the scheme does not consider the situation that IO timeout occurs in a plurality of hard disks at the same time, and when such an exception occurs, all the timeout hard disks are down marked, so that the service is impacted.

In addition, the scheme only monitors the IO time delay of the hard disk, and does not process the real reason of IO timeout (namely, an abnormal hard disk is not found and isolated), or IO is repeatedly oscillated and overtime.

The specific technical scheme is as follows.

In one embodiment, the present disclosure provides a hard disk management method applied to a hard disk management device, where the method includes: obtaining IO state of the monitored hard disk, filtering and finding out the hard disk with IO overtime; acquiring health state information of the hard disk with IO overtime, and finding out the hard disk with abnormal health state information and IO overtime according to the health state information of the hard disk; and sending a power-off command, wherein the power-off command comprises an instruction for powering off the hard disk with abnormal health state information and IO timeout.

Specifically, as shown in fig. 1, the method comprises the following steps:

step S11, obtaining IO state of the monitored hard disk, filtering and finding out the hard disk with IO overtime;

step S12, acquiring the health status information of the hard disk with IO overtime, and finding out the hard disk with abnormal health status information and IO overtime according to the health status information of the hard disk;

step S13, sending a power-off command, where the power-off command includes a command to power off the hard disk with abnormal health status information and IO timeout.

In one embodiment, the sending a power-down command, where the power-down command includes an instruction to power down a hard disk with abnormal health status information and IO timeout, includes: and sending a hard disk power-off command to the BMC so that the BMC controls the CPLD of the hard disk backboard to carry out power-off operation on the hard disk slot position associated with the hard disk with abnormal hard disk health state information and IO timeout through the I2C bus according to the hard disk power-off command.

In an embodiment, the obtaining the IO status of the monitored hard disk, and filtering to find out the hard disk with IO timeout includes: and counting the return time of the hard disk after IO issuing, and if the hard disk does not return within the preset time length, determining that IO timeout exists in the hard disk.

In an embodiment, the obtaining the health status information of the hard disk with IO timeout, and finding out the hard disk with IO timeout and abnormal health status information according to the health status information of the hard disk includes: polling and monitoring the hard disk health state information of the hard disk; the hard disk health state information comprises medium error, hardware error, not ready and io timeout.

The IO _ timeout time of all the disks is changed to 5 seconds, and at the time, the IO of each disk is sensed when the IO is issued for more than 5 seconds and does not return.

For example, the following steps are carried out: if no return is made after a certain IO is issued for more than 20 seconds, the kernel records corresponding trace at four time points of timeout 5 seconds, 10 seconds, 15 seconds and 20 seconds. This means that it will be sensed that the same IO has timeout 4 times, so the time when this IO is blocked can be confirmed by the number of times that the same IO has timeout, and the precision is the time set for IO _ timeout.

When the condition that the IO overtime exists in the disk is detected to be not returned after the IO of the disk is issued for more than 5 seconds, the fact that the IO overtime exists in the disk can be marked, as an alternative, when the condition that the number of the disks in the IO overtime state is larger than or equal to 2, the condition of multi-disk IO overtime is met, and a condition of an abnormal disk processing flow is triggered.

When the IO with the disk is detected to be issued and not returned for more than 30 seconds, the disk is marked to be abnormal, and OSD down is triggered to realize software isolation.

The method comprises the steps of monitoring the health state of the hard disk through a smartcll tool or other active polling modes with corresponding functions or an RAID card asynchronous event reporting mechanism to determine whether the hard disk is abnormal such as medium error, hardware error, not ready, io timeout and the like, and if the hard disk is abnormal, marking the hard disk as an abnormal disk to trigger a condition of an abnormal disk processing flow.

When the RAID card FW finds that the hard disk has the abnormity (including the state abnormity and the IO abnormity), related abnormal events are reported to an operating system, and the real-time health state of the hard disk can be identified by capturing the abnormity through a RAID card driver.

When the module detects that the multi-disk IO blocking and the hard disk state abnormity occur simultaneously, the abnormity processing flow is started.

At the moment, an ipmitool tool or other software modules with functions meeting requirements are used for powering off the abnormal hard disk, so that the abnormal hard disk is prevented from being continuously diffused in time.

Taking an ipmitool tool as an example, a hard disk power-down command can be sent to the BMC in a user mode, and after receiving the command, the BMC controls the backplane CPLD through the I2C bus to complete power-down operation on the designated slot.

In an embodiment, the present disclosure also provides a hard disk management apparatus, as shown in fig. 2, applied to a hard disk management device, the apparatus including: the IO module 21 is configured to obtain an IO state of the monitored hard disk, and filter and find out a hard disk with an IO timeout; the health module 22 is configured to obtain health status information of the hard disk with the IO timeout, and find out the hard disk with the health status information of the hard disk being abnormal and the IO timeout according to the health status information of the hard disk; the power-down module 23 is configured to send a power-down command, where the power-down command includes an instruction for powering down the hard disk with abnormal health status information and IO timeout.

The device embodiments are the same or similar to the corresponding method embodiments and are not described herein again.

In an embodiment, the present disclosure provides an electronic device, including a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions capable of being executed by the processor, and the processor executes the machine-executable instructions to implement the foregoing hard disk management method, and from a hardware level, a schematic diagram of a hardware architecture may be as shown in fig. 3.

In one embodiment, the present disclosure provides a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the aforementioned hard disk management method.

Here, a machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and so forth. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

The systems, devices, modules or units described in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in practicing the disclosure.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Furthermore, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (which may include, but is not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an embodiment of the present disclosure, and is not intended to limit the present disclosure. Various modifications and variations of this disclosure will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the scope of the claims of the present disclosure.

Claims

1. A hard disk management method is applied to a hard disk management device, and the method comprises the following steps:

obtaining IO state of the monitored hard disk, filtering and finding out the hard disk with IO overtime;

acquiring health state information of the hard disk with IO overtime, and finding out the hard disk with abnormal health state information and IO overtime according to the health state information of the hard disk;

and sending a power-off command, wherein the power-off command comprises an instruction for powering off the hard disk with abnormal health state information and IO timeout.

2. The method of claim 1, wherein the sending a power down command, the power down command including an instruction to power down a hard disk with abnormal hard disk health status information and IO timeout, comprises:

and sending a hard disk power-off command to the BMC so that the BMC controls the CPLD of the hard disk backboard to carry out power-off operation on the hard disk slot position associated with the hard disk with abnormal hard disk health state information and IO timeout through the I2C bus according to the hard disk power-off command.

3. The method of claim 1, wherein the obtaining the IO status of the monitored hard disk, and filtering to find the hard disk with IO timeout comprises:

and counting the return time of the hard disk after IO issuing, and if the hard disk does not return within the preset time length, determining that IO timeout exists in the hard disk.

4. The method according to claim 1, wherein the obtaining the health status information of the hard disk with the IO timeout function, and finding out the hard disk with the abnormal health status information and the IO timeout function according to the health status information of the hard disk comprises:

polling and monitoring the hard disk health state information of the hard disk;

the hard disk health state information comprises medium error, hardware error, not ready and io timeout.

5. A hard disk management device is applied to a hard disk management device, and the device comprises:

the IO module is used for acquiring the IO state of the monitored hard disk, and filtering and finding out the hard disk with IO overtime;

the health module is used for acquiring the health state information of the hard disk with IO overtime, and finding out the hard disk with abnormal health state information and IO overtime according to the health state information of the hard disk;

and the power-down module is used for sending a power-down command, and the power-down command comprises a hard disk power-down instruction which enables the health state information of the hard disk to be abnormal and IO timeout exists.

6. The apparatus of claim 5, wherein the sending a power down command, the power down command including an instruction to power down a hard disk with abnormal health status information and IO timeout, comprises:

7. The apparatus of claim 5, wherein the obtaining the IO status of the monitored hard disk, and filtering to find the hard disk with IO timeout comprises:

8. The apparatus according to claim 5, wherein the obtaining the health status information of the hard disk with IO timeout, and finding out the hard disk with abnormal health status information and IO timeout according to the health status information of the hard disk comprises:

polling and monitoring the hard disk health state information of the hard disk;

9. An electronic device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor to perform the method of any one of claims 1 to 4.

10. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any of claims 1-4.