CN115793963A - Hard disk fault processing method, device, equipment and storage medium - Google Patents

Hard disk fault processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN115793963A
CN115793963A CN202211348897.8A CN202211348897A CN115793963A CN 115793963 A CN115793963 A CN 115793963A CN 202211348897 A CN202211348897 A CN 202211348897A CN 115793963 A CN115793963 A CN 115793963A
Authority
CN
China
Prior art keywords
hard disk
target
parameter
determining
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211348897.8A
Other languages
Chinese (zh)
Inventor
陈远喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Data Technology Co Ltd
Original Assignee
Jinan Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Data Technology Co Ltd filed Critical Jinan Inspur Data Technology Co Ltd
Priority to CN202211348897.8A priority Critical patent/CN115793963A/en
Publication of CN115793963A publication Critical patent/CN115793963A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The application relates to the technical field of storage, and discloses a hard disk fault processing method, device, equipment and storage medium, which comprise the following steps: detecting the input and output states of each hard disk in the storage system, and determining a target hard disk with abnormal input and output states according to the detection result; acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining a hard disk parameter of the target hard disk according to a hard disk address corresponding to the repair instruction; and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk. According to the method and the device, the IO state of the hard disk and the SCSI transmission instruction are detected to realize quick and accurate identification, the fault hard disk is closed through a hardware bottom command to achieve the effect of eliminating the fault hard disk, the problem that the IO of the server is blocked due to the hard disk fault is solved, and the problem that the performance of the whole cluster is reduced due to the fact that a certain hard disk in a storage system is solved.

Description

Hard disk fault processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of storage technologies, and in particular, to a hard disk failure processing method, apparatus, device, and storage medium.
Background
The storage server usually uses SAS hard disk or SATA hard disk as main data storage disk, and the SAS hard disk or SATA hard disk and the operating system rely on SCSI protocol to complete the transmission of commands, status and block data. Among various storage technologies, the SCSI (Small Computer System Interface) protocol is widely used and is called the most important spine.
However, the existing SCSI protocol is not perfect, and the main problems are: before entering the SCSI error handling flow, the Host state is set to recovery state (in the SCSI protocol, the Host enters the recovery mode, usually due to an error or a failure), and thereafter the Host will be in the blocking state, and any IO sent to the Host will be blocked until the error handler processing is completed. The purpose of the design is that when the IO is overtime or wrong, the kernel does not know what causes specifically, the problem may be in the Host, when the Host has the problem and the whole Host is not blocked at the moment, the wrong IO is possibly continuously issued, the wrong IO is continuously accumulated, and the error handler cannot be completed all the time. However, due to such a design, when a problem occurs in the next hard disk of the Host, the IO of other hard disks may be blocked until the error handler processing is completed, which may cause IO blocking or delay of the entire system, and when the IO is severe (when the number of disks under the Host is large, such as in a disk array or an expender scenario), the IO of the system may be paralyzed. This problem is very serious in some application scenarios (such as distributed storage scenarios), because one hard disk problem may cause the performance of the whole cluster to be degraded, the traffic to be cut off, and even the traffic interruption cannot be recovered. With the requirement of storage data on capacity increasing, the number of hard disks increasing, and hard disk failure situations are difficult to avoid, and besides improving the quality of hardware units, an effective method is also needed to solve the problem of IO blocking.
Therefore, the above technical problems need to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the present invention provides a hard disk failure processing method, apparatus, device and storage medium, which can avoid performance degradation of the whole cluster caused by a hard disk problem in a storage system. The specific scheme is as follows:
a first aspect of the present application provides a hard disk failure processing method, including:
detecting the input and output states of each hard disk in the storage system, and determining a target hard disk with abnormal input and output states according to the detection result;
acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining hard disk parameters of the target hard disk according to a hard disk address corresponding to the repair instruction;
and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
Optionally, the detecting the input/output state of each hard disk in the storage system, and determining a target hard disk with an abnormal input/output state according to the detection result includes:
detecting parameter values of combined parameters reflecting input and output states of each hard disk in a storage system; the combination parameters are a first combination parameter containing service time and occupancy rate, a second combination parameter containing delay time and bandwidth, or a third combination parameter representing command issuing overtime time;
and if the parameter values of the combined parameters meet corresponding preset conditions, judging that the corresponding hard disk is the target hard disk with abnormal input and output states.
Optionally, if the parameter value of the combination parameter satisfies a corresponding preset condition, the method includes:
if the service time in the first combination parameter is not less than a first threshold value and the occupancy rate is not less than a second threshold value, judging that the parameter value of the first combination parameter meets a corresponding preset condition;
if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not greater than a fourth threshold, determining that the parameter value of the second combination parameter meets a corresponding preset condition;
and if the command issuing timeout time in the third combination parameter is greater than a fifth threshold, determining that the parameter value of the third combination parameter meets a corresponding preset condition.
Optionally, in the hard disk failure processing method, a mapping relationship exists between a hard disk parameter and a hard disk address of each hard disk in the storage system, and the mapping relationship is stored in a form of a mapping file, so as to determine the hard disk parameter of the target hard disk according to the mapping file.
Optionally, the determining the hard disk parameter of the target hard disk according to the hard disk address corresponding to the repair instruction includes:
and extracting a corresponding hard disk address from the repair instruction, and determining a hard disk parameter corresponding to the hard disk address according to the mapping relation stored in the mapping file.
Optionally, after the executing the bottom layer command to disconnect the physical layer port of the target hard disk, the method further includes:
and judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk.
Optionally, after generating the bottom layer command based on the hard disk parameter of the target hard disk, the method further includes:
and setting overtime time for the bottom layer command, and if the physical layer port connection of the target hard disk is not disconnected within the overtime time, forcibly executing an abort command on the target hard disk.
A second aspect of the present application provides a hard disk failure processing apparatus, including:
the state detection module is used for detecting the input and output states of each hard disk in the storage system and determining a target hard disk with abnormal input and output states according to the detection result;
the parameter determining module is used for acquiring a repair instruction issued to the target hard disk by a small computer system interface drive and determining the hard disk parameters of the target hard disk according to the hard disk address corresponding to the repair instruction;
and the command generating and executing module is used for generating a bottom layer command based on the hard disk parameters of the target hard disk and executing the bottom layer command to disconnect the physical layer port of the target hard disk.
A third aspect of the present application provides an electronic device comprising a processor and a memory; the memory is used for storing a computer program, and the computer program is loaded and executed by the processor to realize the hard disk failure processing method.
A fourth aspect of the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the foregoing hard disk failure processing method is implemented.
In the method, the input and output states of each hard disk in a storage system are detected, and a target hard disk with abnormal input and output states is determined according to a detection result; then, a repair instruction issued to the target hard disk by a small computer system interface driver is obtained, and hard disk parameters of the target hard disk are determined according to a hard disk address corresponding to the repair instruction; and finally, generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk. Therefore, the method and the device can rapidly and accurately identify the IO state of the hard disk and the SCSI transmission instruction by detecting the IO state of the hard disk, and close the failed hard disk through the hardware bottom layer command to achieve the effect of eliminating the failed hard disk, so that the problem of IO blocking of a server caused by hard disk failure is solved, and the performance reduction of the whole cluster caused by the problem of a certain hard disk in a storage system is avoided. The problems are solved from the root cause, the problems are prevented from being diffused, the stability of the whole system is improved, and the normal work of the storage service is guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a hard disk failure processing method provided in the present application;
fig. 2 is a flowchart of a specific hard disk failure processing method provided in the present application;
FIG. 3 is a flowchart of a specific slow disc detection method provided in the present application;
fig. 4 is a schematic structural diagram of a hard disk failure processing apparatus according to the present application;
fig. 5 is a structural diagram of an electronic device for processing hard disk failure according to the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Before entering into the SCSI error processing flow, the existing SCSI protocol sets the Host state as recovery state, and then the Host is in the blocking state, and any IO sent to the Host is blocked until the error handler processing is completed. Due to the design, when the next hard disk of the Host goes wrong, IO of other hard disks can be blocked until the error handler processing is completed, IO blocking or delay of the whole system can be caused, and IO paralysis of the system can be caused in a serious case. With the requirement of storage data on capacity increasing, the number of hard disks increasing, and hard disk failure situations are difficult to avoid, and besides improving the quality of hardware units, an effective method is also needed to solve the problem of IO blocking. In order to overcome the technical defects, the application provides a hard disk failure processing scheme, which is used for rapidly and accurately identifying by detecting the IO state of a hard disk and an SCSI transmission instruction, and closing the failed hard disk through a hardware bottom command to achieve the effect of eliminating the failed hard disk, so that the problem of IO blocking of a server caused by hard disk failure is solved, and the performance reduction of the whole cluster caused by the problem of a certain hard disk in a storage system is avoided.
Fig. 1 is a flowchart of a hard disk failure processing method according to an embodiment of the present application. Referring to fig. 1, the hard disk failure processing method includes:
s11: and detecting the input and output states of each hard disk in the storage system, and determining a target hard disk with abnormal input and output states according to the detection result.
In this embodiment, the input/output state of each hard disk in the storage system is detected first, and a target hard disk with an abnormal input/output state is determined according to the detection result. The storage system generally comprises a plurality of hard disks including but not limited to SAS hard disks, SATA hard disks, etc., and IO status of each hard disk is detected, and then a hard disk with abnormal IO, that is, the target hard disk, is determined according to a detection result.
S12: and acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining the hard disk parameters of the target hard disk according to the hard disk address corresponding to the repair instruction.
In this embodiment, when there is an IO exception, the monitoring of the SCSI driving transfer instruction is started. And then acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining the hard disk parameters of the target hard disk according to the hard disk address corresponding to the repair instruction. When the SCSI drive sends a reset repair instruction to the IO abnormal hard disk, the hard disk parameters of the corresponding hard disk are found through the SCSI address of the reset repair instruction. Namely, according to the SCSI address corresponding to the RESET instruction of the SCSI drive to the failed hard disk, the relevant information of the failed hard disk, such as channel id, device id, index, phy id and the like, is found, and the information is prestored in a mapping file in advance, so that the disk kicking execution efficiency is improved.
Specifically, in the storage system of the present embodiment, a mapping relationship exists between a hard disk parameter of each hard disk and a hard disk address, and the mapping relationship is stored in the form of a mapping file, so as to determine the hard disk parameter of the target hard disk according to the mapping file. When determining the hard disk parameters, extracting the corresponding hard disk address from the repair instruction, and determining the hard disk parameters corresponding to the hard disk address according to the mapping relation stored in the mapping file.
S13: and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
In this embodiment, after the failed hard disk is determined, a bottom layer command is generated based on the hard disk parameters of the target hard disk, and the bottom layer command is executed to disconnect the physical layer port connection of the target hard disk. It can be understood that if there is an IO exception of the hard disk and a reset operation is attempted on the hard disk, which indicates that a failed hard disk has occurred, the Host of the system enters a recovery mode, and if the Host does not exit quickly, IO blocking may be generated, so that the failed hard disk needs to be removed from the system.
In this embodiment, directly use the bottom order to close corresponding unusual hard disk phy, kick the system off with the hard disk, the IO is blockked up from the root cause to quick accuracy, prevents the trouble diffusion. And specifically, a hardware bottom layer related command of the SAS/RAID card can be called, the phy corresponding to the hard disk with IO abnormality is closed, and the hard disk is kicked off the system. In addition, depending on the SAS/RAID card used, the present embodiment may use the following commands:
arcconf setstate 1device<channel id><device id>DDD noprompt
scrtnycli.x86_64–I<index>phy–off<phy id>
therefore, the input and output states of each hard disk in the storage system are detected, and the target hard disk with abnormal input and output states is determined according to the detection result; then, a repair instruction issued to the target hard disk by a small computer system interface driver is obtained, and hard disk parameters of the target hard disk are determined according to a hard disk address corresponding to the repair instruction; and finally, generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk. According to the embodiment of the application, the IO state of the hard disk and the SCSI transmission instruction are detected to realize quick and accurate identification, and the fault hard disk is closed through the hardware bottom command to achieve the effect of eliminating the fault hard disk, so that the problem that the IO of a server is blocked due to hard disk faults is solved, and the performance of the whole cluster is prevented from being reduced due to the fact that a certain hard disk in a storage system is damaged. The problem is solved from the root cause, the problem diffusion is prevented, the stability of the whole system is improved, and the normal work of the storage service is ensured.
Fig. 2 is a flowchart of a specific hard disk failure processing method according to an embodiment of the present application. Referring to fig. 2, the hard disk failure processing method includes:
s21: detecting parameter values of combined parameters reflecting input and output states of each hard disk in a storage system; the combination parameter is a first combination parameter including service time and occupancy rate, a second combination parameter including delay time and bandwidth, or a third combination parameter representing command issuing timeout time.
In this embodiment, when detecting the IO state, specifically, the parameter values of the combination parameters reflecting the input/output state of each hard disk in the storage system are detected. The combination parameter is a first combination parameter including service time and occupancy rate, a second combination parameter including delay time and bandwidth, or a third combination parameter representing command issuing timeout time. The process is a process of detecting the slow disk in the storage system, and the delay of IO operation performed on all the hard disks is monitored, that is, the IO state of each hard disk is monitored. The method mainly comprises the following three parameters: the system comprises a first combination parameter containing service time and occupancy rate, a second combination parameter containing delay time and bandwidth, or a third combination parameter representing command issuing timeout time.
S22: and if the parameter values of the combined parameters meet corresponding preset conditions, judging that the corresponding hard disk is the target hard disk with abnormal input and output states.
In this embodiment, if the parameter value of the combination parameter satisfies the corresponding preset condition, it is determined that the corresponding hard disk is the target hard disk with an abnormal input/output state. The method specifically comprises the following steps (shown in figure 3):
s221: and if the service time in the first combination parameter is not less than a first threshold value and the occupancy rate is not less than a second threshold value, judging that the parameter value of the first combination parameter meets the corresponding preset condition.
S222: and if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not more than a fourth threshold, determining that the parameter value of the second combination parameter meets the corresponding preset condition.
S223: and if the command issuing timeout time in the third combination parameter is greater than a fifth threshold, determining that the parameter value of the third combination parameter meets the corresponding preset condition.
In this embodiment, for the first combination parameter, if the service time in the first combination parameter is not less than a first threshold and the occupancy rate in the first combination parameter is not less than a second threshold, it is determined that a parameter value of the first combination parameter satisfies a corresponding preset condition; for the second combination parameter, if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not greater than a fourth threshold, determining that the parameter value of the second combination parameter satisfies a corresponding preset condition; for the third combination parameter, if the command issuance timeout time in the third combination parameter is greater than a fifth threshold, it is determined that a parameter value of the third combination parameter satisfies a corresponding preset condition.
Further, the present embodiment may set the conditions and thresholds meeting the slow disc standard by means of an iostat tool, for example:
(1) svctm > =90ms and min _ util > =50;
(2) await > =600ms and min _ ops < =50;
(3) IO command issue timeout >10s.
When any one of the above 3 conditions is satisfied, it is indicated that IO of the hard disk is abnormal, and the hard disk may be a slow disk.
S23: and acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining the hard disk parameters of the target hard disk according to the hard disk address corresponding to the repair instruction.
S24: and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
In this embodiment, for the specific processes of step S23 and step S24, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated herein. It should be added that, in the embodiment, all IO operations of the SCSI driver can be monitored through the kernel module loaded in advance. When detecting the IO abnormality, monitoring all IO operations of SCSI drive for 5 minutes.
S25: and judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk.
In this embodiment, after the command to close the phy is executed, the system is removed from the failed hard disk normally, and the related log is recorded and an alarm is prompted. And finally, judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk. Namely, after the new hard disk is replaced, the link state is automatically recovered corresponding to the closing of the phy.
In addition, in this embodiment, a timeout time may be set for the bottom layer command, and if the physical layer port connection of the target hard disk is not disconnected within the timeout time, the abort command is forcibly executed on the target hard disk. That is, the timeout time is set for the close phy command, if the system is not removed from the hard disk within the specified time, the command is forcibly terminated, so that problems caused by link or other abnormalities are prevented from spreading, and related logs are recorded.
Therefore, in the embodiment of the application, the parameter values of the combination parameters reflecting the input and output states of each hard disk in the storage system are detected; the combination parameter is a first combination parameter including service time and occupancy rate, a second combination parameter including delay time and bandwidth, or a third combination parameter representing command issuing timeout time. And if the parameter values of the combined parameters meet corresponding preset conditions, judging that the corresponding hard disk is the target hard disk with abnormal input and output states. Specifically, if the service time in the first combination parameter is not less than a first threshold and the occupancy rate is not less than a second threshold, it is determined that the parameter value of the first combination parameter satisfies a corresponding preset condition; if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not more than a fourth threshold, determining that the parameter value of the second combination parameter meets the corresponding preset condition; and if the command issuing timeout time in the third combination parameter is greater than a fifth threshold, determining that the parameter value of the third combination parameter meets the corresponding preset condition. On the basis, a repair instruction issued to the target hard disk by a small computer system interface driver is obtained, and hard disk parameters of the target hard disk are determined according to a hard disk address corresponding to the repair instruction. And generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk. And finally, judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk. Therefore, the problems that the performance of the whole cluster is reduced, the service is cut off and even the service interruption cannot be recovered due to the problem of one hard disk in a storage product line are solved, and the hard disk fault of the main trigger scene of IO blocking can be covered.
Referring to fig. 4, an embodiment of the present application further discloses a hard disk failure processing apparatus, which includes:
the state detection module 11 is configured to detect input and output states of each hard disk in the storage system, and determine a target hard disk with an abnormal input and output state according to a detection result;
a parameter determining module 12, configured to obtain a repair instruction issued to the target hard disk by a small computer system interface driver, and determine a hard disk parameter of the target hard disk according to a hard disk address corresponding to the repair instruction;
and the command generating and executing module 13 is configured to generate a bottom layer command based on the hard disk parameters of the target hard disk, and execute the bottom layer command to disconnect a physical layer port of the target hard disk.
In this embodiment, the state detection module 11 first detects the input/output state of each hard disk in the storage system, and determines a target hard disk with an abnormal input/output state according to the detection result. The storage system generally comprises a plurality of hard disks, including but not limited to SAS hard disks, SATA hard disks, etc., and the IO state of each hard disk is detected, and then the hard disk with abnormal IO, that is, the target hard disk, is determined according to the detection result.
In this embodiment, when there is an IO exception, the monitoring of the SCSI driving transfer instruction is started. Then, the parameter determining module 12 obtains a repair instruction issued to the target hard disk by the small computer system interface driver, and determines the hard disk parameter of the target hard disk according to the hard disk address corresponding to the repair instruction. When the SCSI drive sends a reset repair instruction to the IO abnormal hard disk, the hard disk parameters of the corresponding hard disk are found through the SCSI address of the reset repair instruction. Namely, according to the SCSI address corresponding to the RESET instruction of the SCSI drive to the failed hard disk, the relevant information of the failed hard disk, such as channel id, deviceid, index, phy id and the like, is found, and the information is prestored in a mapping file in advance, so that the execution efficiency of kicking the disk is improved.
Specifically, the hard disk parameters and the hard disk addresses of the hard disks in the storage system of this embodiment have mapping relationships, and the mapping relationships are stored in the form of mapping files, so as to determine the hard disk parameters of the target hard disk according to the mapping files. When determining the hard disk parameters, extracting the corresponding hard disk address from the repair instruction, and determining the hard disk parameters corresponding to the hard disk address according to the mapping relation stored in the mapping file.
In this embodiment, after determining the failed hard disk, the command generating and executing module 13 generates a bottom layer command based on the hard disk parameters of the target hard disk, and executes the bottom layer command to disconnect the physical layer port of the target hard disk. It can be understood that if there is an IO exception of the hard disk and a reset operation is attempted on the hard disk, which indicates that a failed hard disk has occurred, the Host of the system enters a recovery mode, and IO blocking may be generated by not exiting quickly, so that the failed hard disk needs to be removed from the system.
In this embodiment, directly use the bottom order to close corresponding unusual hard disk phy, kick the system off with the hard disk, the IO is blockked up from the root cause to quick accuracy, prevents the fault diffusion. The method specifically can call related commands of the hardware bottom layer of the SAS/RAID card, close the phy corresponding to the hard disk with IO abnormality and realize the kick-off of the hard disk from the system. In addition, depending on the SAS/RAID card used, the present embodiment may use the following commands:
arcconf setstate 1device<channel id><device id>DDD noprompt
scrtnycli.x86_64–I<index>phy–off<phy id>
therefore, the input and output states of each hard disk in the storage system are detected, and the target hard disk with abnormal input and output states is determined according to the detection result; then, a repair instruction issued to the target hard disk by a small computer system interface driver is obtained, and hard disk parameters of the target hard disk are determined according to a hard disk address corresponding to the repair instruction; and finally, generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk. According to the embodiment of the application, the IO state of the hard disk and the SCSI transmission instruction are detected to quickly and accurately identify the hard disk, and the fault hard disk is closed through a hardware bottom layer command to achieve the effect of eliminating the fault hard disk, so that the problem that IO blocking of a server is caused by hard disk faults is solved, and the problem that the performance of the whole cluster is reduced due to the fact that a certain hard disk in a storage system is damaged is avoided. The problems are solved from the root cause, the problems are prevented from being diffused, the stability of the whole system is improved, and the normal work of the storage service is guaranteed.
In some specific embodiments, the state detection module 11 specifically includes:
a parameter detection unit for detecting parameter values of combination parameters reflecting input and output states of each hard disk in the storage system; the combination parameters are a first combination parameter containing service time and occupancy rate, a second combination parameter containing delay time and bandwidth, or a third combination parameter representing command issuing overtime time;
and the state judgment unit is used for judging that the corresponding hard disk is the target hard disk with abnormal input and output states if the parameter values of the combined parameters meet corresponding preset conditions.
In some embodiments, the state determining unit specifically includes:
a first state judgment subunit, configured to judge that a parameter value of the first combination parameter satisfies a corresponding preset condition if the service time in the first combination parameter is not less than a first threshold and the occupancy rate is not less than a second threshold;
a second state judgment subunit, configured to judge that a parameter value of the second combination parameter satisfies a corresponding preset condition if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not greater than a fourth threshold;
and the third state judgment subunit is configured to judge that the parameter value of the third combination parameter satisfies a corresponding preset condition if the command issuance timeout time in the third combination parameter is greater than a fifth threshold.
In some embodiments, there is a mapping relationship between a hard disk parameter and a hard disk address of each hard disk in a storage system in the hard disk failure processing apparatus, and the mapping relationship is stored in the form of a mapping file, so as to determine a hard disk parameter of the target hard disk according to the mapping file.
In some specific embodiments, the parameter determining module 12 is specifically configured to extract a corresponding hard disk address from the repair instruction, and determine a hard disk parameter corresponding to the hard disk address according to the mapping relationship stored in the mapping file.
In some specific embodiments, the hard disk failure processing apparatus further includes:
the connection recovery module is used for judging whether the target hard disk is removed from the storage system or not, and if so, recovering the physical layer port connection of the target hard disk;
and the forced suspension module is used for setting timeout time for the bottom layer command, and forcibly executing the suspension command on the target hard disk if the physical layer port connection of the target hard disk is not disconnected within the timeout time.
Furthermore, the embodiment of the application also provides electronic equipment. Fig. 5 is a block diagram of electronic device 20 shown in accordance with an exemplary embodiment, and the contents of the diagram should not be construed as limiting the scope of use of the present application in any way.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is adapted to store a computer program, which is loaded and executed by the processor 21, to implement at least the following steps:
detecting the input and output states of each hard disk in the storage system, and determining a target hard disk with abnormal input and output states according to the detection result;
acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining hard disk parameters of the target hard disk according to a hard disk address corresponding to the repair instruction;
and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
Optionally, the detecting the input/output state of each hard disk in the storage system, and determining a target hard disk with an abnormal input/output state according to the detection result includes:
detecting parameter values of combined parameters reflecting input and output states of each hard disk in a storage system; the combination parameters are a first combination parameter containing service time and occupancy rate, a second combination parameter containing delay time and bandwidth, or a third combination parameter representing command issuing overtime time;
and if the parameter values of the combined parameters meet corresponding preset conditions, judging that the corresponding hard disk is the target hard disk with abnormal input and output states.
Optionally, if the parameter value of the combination parameter satisfies a corresponding preset condition, the method includes:
if the service time in the first combination parameter is not less than a first threshold value and the occupancy rate is not less than a second threshold value, judging that the parameter value of the first combination parameter meets a corresponding preset condition;
if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not greater than a fourth threshold, determining that the parameter value of the second combination parameter meets a corresponding preset condition;
and if the command issuing timeout time in the third combination parameter is greater than a fifth threshold, determining that the parameter value of the third combination parameter meets the corresponding preset condition.
Optionally, in the hard disk failure processing method, a mapping relationship exists between a hard disk parameter and a hard disk address of each hard disk in the storage system, and the mapping relationship is stored in a form of a mapping file, so as to determine the hard disk parameter of the target hard disk according to the mapping file.
Optionally, the determining the hard disk parameter of the target hard disk according to the hard disk address corresponding to the repair instruction includes:
and extracting a corresponding hard disk address from the repair instruction, and determining a hard disk parameter corresponding to the hard disk address according to the mapping relation stored in the mapping file.
Optionally, after the executing the bottom layer command to disconnect the physical layer port of the target hard disk, the method further includes:
and judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk.
Optionally, after generating the bottom layer command based on the hard disk parameter of the target hard disk, the method further includes:
and setting timeout time for the bottom layer command, and if the physical layer port connection of the target hard disk is not disconnected within the timeout time, forcibly executing an abort command on the target hard disk.
In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, data 223, etc., and the storage may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the mass data 223 in the memory 22 by the processor 21, and may be Windows Server, netware, unix, linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the hard disk failure processing method performed by the electronic device 20 disclosed in any of the foregoing embodiments. Data 223 may include repair instructions collected by electronic device 20.
Further, an embodiment of the present application also discloses a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, at least the following steps are implemented:
detecting the input and output states of each hard disk in the storage system, and determining a target hard disk with abnormal input and output states according to the detection result;
acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining a hard disk parameter of the target hard disk according to a hard disk address corresponding to the repair instruction;
and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
Optionally, the detecting the input/output state of each hard disk in the storage system, and determining a target hard disk with an abnormal input/output state according to the detection result includes:
detecting parameter values of combined parameters reflecting input and output states of each hard disk in a storage system; the combination parameters are first combination parameters comprising service time and occupancy rate, second combination parameters comprising delay time and bandwidth, or third combination parameters representing command issuing overtime time;
and if the parameter values of the combined parameters meet corresponding preset conditions, judging that the corresponding hard disk is the target hard disk with abnormal input and output states.
Optionally, if the parameter value of the combination parameter satisfies a corresponding preset condition, the method includes:
if the service time in the first combination parameter is not less than a first threshold value and the occupancy rate is not less than a second threshold value, judging that the parameter value of the first combination parameter meets a corresponding preset condition;
if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not more than a fourth threshold, determining that the parameter value of the second combination parameter meets the corresponding preset condition;
and if the command issuing timeout time in the third combination parameter is greater than a fifth threshold, determining that the parameter value of the third combination parameter meets a corresponding preset condition.
Optionally, in the hard disk failure processing method, a mapping relationship exists between a hard disk parameter of each hard disk in the storage system and a hard disk address, and the mapping relationship is stored in the form of a mapping file, so as to determine the hard disk parameter of the target hard disk according to the mapping file.
Optionally, the determining the hard disk parameter of the target hard disk according to the hard disk address corresponding to the repair instruction includes:
and extracting a corresponding hard disk address from the repair instruction, and determining a hard disk parameter corresponding to the hard disk address according to the mapping relation stored in the mapping file.
Optionally, after the executing the bottom layer command to disconnect the physical layer port of the target hard disk, the method further includes:
and judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk.
Optionally, after generating the bottom layer command based on the hard disk parameter of the target hard disk, the method further includes:
and setting overtime time for the bottom layer command, and if the physical layer port connection of the target hard disk is not disconnected within the overtime time, forcibly executing an abort command on the target hard disk.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" \8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The hard disk failure processing method, apparatus, device and storage medium provided by the present invention are described in detail above, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A hard disk failure processing method is characterized by comprising the following steps:
detecting the input and output states of each hard disk in the storage system, and determining a target hard disk with abnormal input and output states according to the detection result;
acquiring a repair instruction issued to the target hard disk by a small computer system interface driver, and determining a hard disk parameter of the target hard disk according to a hard disk address corresponding to the repair instruction;
and generating a bottom layer command based on the hard disk parameters of the target hard disk, and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
2. The method for processing hard disk failures according to claim 1, wherein the detecting the input/output states of each hard disk in the storage system and determining the target hard disk with abnormal input/output states according to the detection result comprises:
detecting parameter values of combined parameters reflecting input and output states of each hard disk in a storage system; the combination parameters are a first combination parameter containing service time and occupancy rate, a second combination parameter containing delay time and bandwidth, or a third combination parameter representing command issuing overtime time;
and if the parameter values of the combined parameters meet corresponding preset conditions, judging that the corresponding hard disk is the target hard disk with abnormal input and output states.
3. The method of claim 2, wherein if the parameter values of the combination parameters satisfy the corresponding preset conditions, the method comprises:
if the service time in the first combination parameter is not less than a first threshold value and the occupancy rate is not less than a second threshold value, determining that the parameter value of the first combination parameter meets a corresponding preset condition;
if the delay time in the second combination parameter is not less than a third threshold and the bandwidth is not greater than a fourth threshold, determining that the parameter value of the second combination parameter meets a corresponding preset condition;
and if the command issuing timeout time in the third combination parameter is greater than a fifth threshold, determining that the parameter value of the third combination parameter meets the corresponding preset condition.
4. The hard disk failure processing method according to claim 1, wherein a mapping relationship exists between hard disk parameters and hard disk addresses of each hard disk in the storage system, and the mapping relationship is stored in a form of a mapping file, so as to determine the hard disk parameters of the target hard disk according to the mapping file.
5. The method for processing hard disk failures according to claim 4, wherein the determining the hard disk parameters of the target hard disk according to the hard disk address corresponding to the repair instruction includes:
and extracting a corresponding hard disk address from the repair instruction, and determining a hard disk parameter corresponding to the hard disk address according to the mapping relation stored in the mapping file.
6. The method for processing hard disk failure according to any of claims 1 to 5, wherein after the executing the bottom layer command to disconnect the physical layer port of the target hard disk, the method further comprises:
and judging whether the target hard disk is removed from the storage system, and if so, recovering the physical layer port connection of the target hard disk.
7. The hard disk failure processing method according to any of claims 1 to 5, after generating the underlying command based on the hard disk parameters of the target hard disk, further comprising:
and setting timeout time for the bottom layer command, and if the physical layer port connection of the target hard disk is not disconnected within the timeout time, forcibly executing an abort command on the target hard disk.
8. A hard disk failure processing apparatus, comprising:
the state detection module is used for detecting the input and output states of each hard disk in the storage system and determining a target hard disk with abnormal input and output states according to the detection result;
the parameter determining module is used for acquiring a repair instruction issued to the target hard disk by a small computer system interface drive and determining the hard disk parameters of the target hard disk according to the hard disk address corresponding to the repair instruction;
and the command generating and executing module is used for generating a bottom layer command based on the hard disk parameters of the target hard disk and executing the bottom layer command to disconnect the physical layer port connection of the target hard disk.
9. An electronic device, comprising a processor and a memory, wherein:
the memory is used for storing a computer program;
the computer program is loaded and executed by the processor to implement the hard disk failure handling method according to any one of claims 1 to 7.
10. A computer-readable storage medium storing computer-executable instructions which, when loaded and executed by a processor, implement the hard disk failure handling method of any one of claims 1 to 7.
CN202211348897.8A 2022-10-31 2022-10-31 Hard disk fault processing method, device, equipment and storage medium Pending CN115793963A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211348897.8A CN115793963A (en) 2022-10-31 2022-10-31 Hard disk fault processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211348897.8A CN115793963A (en) 2022-10-31 2022-10-31 Hard disk fault processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115793963A true CN115793963A (en) 2023-03-14

Family

ID=85434610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211348897.8A Pending CN115793963A (en) 2022-10-31 2022-10-31 Hard disk fault processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115793963A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117785074A (en) * 2024-02-28 2024-03-29 济南浪潮数据技术有限公司 Method, device, server and medium for processing input/output timeout
CN117806915A (en) * 2024-02-29 2024-04-02 苏州元脑智能科技有限公司 Method, device, computer equipment and storage medium for hard disk fault management
CN117806915B (en) * 2024-02-29 2024-05-24 苏州元脑智能科技有限公司 Method, device, computer equipment and storage medium for hard disk fault management

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117785074A (en) * 2024-02-28 2024-03-29 济南浪潮数据技术有限公司 Method, device, server and medium for processing input/output timeout
CN117806915A (en) * 2024-02-29 2024-04-02 苏州元脑智能科技有限公司 Method, device, computer equipment and storage medium for hard disk fault management
CN117806915B (en) * 2024-02-29 2024-05-24 苏州元脑智能科技有限公司 Method, device, computer equipment and storage medium for hard disk fault management

Similar Documents

Publication Publication Date Title
CN107179957B (en) Physical machine fault classification processing method and device and virtual machine recovery method and system
CN110677480B (en) Node health management method and device and computer readable storage medium
US20070168201A1 (en) Formula for automatic prioritization of the business impact based on a failure on a service in a loosely coupled application
CN109376029B (en) Processing method and processing system for SCSI hard disk abnormal overtime
CN110659159A (en) Service process operation monitoring method, device, equipment and storage medium
CN115793963A (en) Hard disk fault processing method, device, equipment and storage medium
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN109062723A (en) The treating method and apparatus of server failure
CN113672415A (en) Disk fault processing method, device, equipment and storage medium
US20040073648A1 (en) Network calculator system and management device
JP3139548B2 (en) Error retry method, error retry system, and recording medium therefor
CN116755920B (en) Fault positioning method, device, apparatus, storage medium and electronic equipment
CN111478792B (en) Cutover information processing method, system and device
CN112015597B (en) Fault isolation method, device, equipment and computer readable storage medium
WO1999023562A1 (en) Automatic backup based on disk drive condition
CN115632706B (en) FC link management method, device, equipment and readable storage medium
CN116578459A (en) Slow disk monitoring and processing method, device and computer readable storage medium
CN111124785A (en) Hard disk fault checking method, device, equipment and storage medium
CN115470059A (en) Disk detection method, device, equipment and storage medium
CN110795276A (en) Storage medium repairing method, computer equipment and storage medium
CN114884836A (en) High-availability method, device and medium for virtual machine
CN111124729A (en) Fault disk determination method, device, equipment and computer readable storage medium
CN112905484A (en) Self-adaptive closed loop performance test method, system and medium
CN111625185A (en) Method, system and related assembly for monitoring disk fault
CN113656358A (en) Database log file processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination