CN113077835A - Hard disk fault detection method, device, equipment and readable storage medium - Google Patents

Hard disk fault detection method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN113077835A
CN113077835A CN202110380576.5A CN202110380576A CN113077835A CN 113077835 A CN113077835 A CN 113077835A CN 202110380576 A CN202110380576 A CN 202110380576A CN 113077835 A CN113077835 A CN 113077835A
Authority
CN
China
Prior art keywords
preset
hard disk
index item
server node
numerical value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110380576.5A
Other languages
Chinese (zh)
Inventor
彭洁
刘谦
刘畅
屈大伟
李宇翔
陈龙辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110380576.5A priority Critical patent/CN113077835A/en
Publication of CN113077835A publication Critical patent/CN113077835A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C2029/4402Internal storage of test result, quality data, chip identification, repair information

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a fault detection method, a fault detection device, equipment and a readable storage medium for a hard disk, wherein an operation log of each server node in a server cluster is obtained, the occurrence frequency of a preset index item of each server node is determined according to the operation log, and if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, the hard disk of the server node is used as a target hard disk. The preset time threshold corresponding to each preset index item is configured according to the probability of the fault of the hard disk of the server node, and further, the method only needs to acquire the SMART log of the target hard disk and detects whether the target hard disk has the fault according to the SMART log of the target hard disk. According to the method, the SMART log of the hard disk of each server node does not need to be obtained, the problem that the failure processing is not timely due to the fact that the time for obtaining the SMART log is long is avoided, and the method also improves the efficiency of hard disk failure detection.

Description

Hard disk fault detection method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for detecting a failure of a hard disk.
Background
With the rapid development of computer Technology, in order to meet the continuously improved requirements for computing and storing mass data, the prior art continuously increases the capacity of a single hard disk, and continuously increases the number of hard disks mounted in a server cluster by using the server cluster Technology, and in the actual operation and maintenance work of the server cluster, usually picks an index item in a log (abbreviated as SMART log) of each hard disk (Self-Monitoring Analysis and Reporting Technology ", Self-Monitoring and Analysis and Reporting Technology of hard disks) for fault judgment and processing, but for the server cluster with increasingly large scale, the access amount of a service system to the hard disks is increased in geometric scale, and the hard disks are in a 100% busy working state for most of time, so the failure rate of the hard disks is greatly increased, and obviously, a method for collecting the SMART log of each failed hard disk for fault processing, due to the fact that the acquisition process of the SMART log consumes long time, failure processing is not timely, and the efficiency of hard disk failure detection needs to be improved.
Disclosure of Invention
The application provides a method, a device, equipment and a readable storage medium for detecting the fault of a hard disk, aiming at improving the efficiency of acquiring the fault information of the hard disk, and the method comprises the following steps:
a fault detection method of a hard disk comprises the following steps:
acquiring an operation log of each server node in a server cluster;
determining the occurrence frequency of a preset index item of each server node according to the running log of each server node;
if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
acquiring a SMART log of the target hard disk;
and detecting whether the target hard disk fails or not according to the SMART log of the target hard disk.
Optionally, the preset index item includes: the system comprises a strong correlation index item and a secondary correlation index item, wherein a preset time threshold corresponding to the strong correlation index item is equal to 1, and a preset time threshold corresponding to the secondary correlation index item is larger than 1.
Optionally, if the occurrence frequency of at least one of the preset index items of the server node is greater than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk includes:
if at least one item of the strongly-related index item exists in the running log, taking a hard disk of the server node as a target hard disk;
if the strongly relevant index items do not exist in the running log, judging whether the occurrence frequency of each secondary relevant index item is greater than a preset frequency threshold corresponding to the secondary relevant index item;
and if the occurrence frequency of at least one secondary correlation index item is greater than a preset frequency threshold corresponding to the secondary correlation index item, determining that the hard disk of the server node is a target hard disk.
Optionally, detecting whether the target hard disk fails according to the SMART log of the target hard disk includes:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
Optionally, the preset hard disk index items include a first preset hard disk index item, a second preset hard disk index item, a third preset hard disk index item and a fourth preset hard disk index item;
the preset fault condition corresponding to the first preset hard disk index item is as follows: the numerical value of the first preset hard disk index item is not 0;
the preset fault condition corresponding to the second preset hard disk index item is as follows: the numerical value of the second preset hard disk index item is equal to a preset fault value;
the preset fault condition corresponding to the third preset hard disk index item is as follows: the numerical value of the third preset hard disk index item is not equal to a preset normal value;
the preset fault condition corresponding to the fourth preset hard disk index item is as follows: and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
Optionally, if the value of at least one preset hard disk index item meets a preset fault condition corresponding to the preset index item, determining that the target hard disk fails includes:
if the following conditions are met: and determining that the target hard disk has a fault if at least one of the numerical values of the first preset hard disk index item is not 0, the numerical value of the second preset hard disk index item is equal to a first preset numerical value, the numerical value of the third preset hard disk index item is not equal to a second preset numerical value, and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
A failure detection apparatus of a hard disk, comprising:
the first log acquiring unit is used for acquiring the running logs of each server node in the server cluster;
the frequency acquisition unit is used for determining the occurrence frequency of a preset index item of each server node according to the running log of each server node;
a target hard disk determining unit, configured to, if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, use a hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
a second log obtaining unit, configured to obtain a SMART log of the target hard disk;
and the fault determining unit is used for detecting whether the target hard disk has a fault or not according to the SMART log of the target hard disk.
Optionally, the fault determining unit is configured to detect whether the target hard disk fails according to the SMART log of the target hard disk, and includes: the fault determination unit is specifically configured to:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
A failure detection apparatus of a hard disk, comprising: a memory and a processor;
the memory is used for storing programs;
the processor is used for executing the program and realizing each step of the fault detection method of the hard disk.
A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method for fault detection of a hard disk.
According to the technical scheme, the method, the device, the equipment and the readable storage medium for detecting the fault of the hard disk, which are provided by the embodiment of the application, are used for obtaining the operation logs of each server node in the server cluster, determining the occurrence frequency of the preset index items of each server node according to the operation logs, and taking the hard disk of the server node as a target hard disk if the occurrence frequency of at least one preset index item of the server node is not less than the preset frequency threshold corresponding to the preset index item. In the historical operation log of the server node, when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item, the probability of the hard disk of the server node failing is greater than the first preset probability threshold, so that the target hard disk is the hard disk with the probability of the failure greater than the first preset probability threshold. Therefore, the method only needs to acquire the SMART log of the target hard disk and detect whether the target hard disk fails or not according to the SMART log of the target hard disk. Because the time consumption for acquiring the operation logs of the server nodes is short and the efficiency is high, the method does not need to acquire the SMART log of the hard disk of each server node, the problem that the failure processing is not timely due to the fact that the time consumption for acquiring the SMART log is long is avoided, and the method also improves the efficiency of hard disk failure detection.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a specific implementation of a hard disk failure detection method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for detecting a failure of a hard disk according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a hard disk failure detection apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a fault detection device for a hard disk according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method for detecting the fault of the hard disk provided in the embodiment of the present application is applied to, but not limited to, obtaining fault information of a hard disk deployed in a server cluster, and aims to be timely, where the server cluster includes at least two server nodes, and in practical application, the number of the server nodes in the server cluster is huge, in this embodiment, N is used to represent the number of the server nodes in the server cluster, it needs to be stated that at least one hard disk is deployed in each server node, the type of the hard disk includes, but is not limited to, a solid state disk, and the working mode of the hard disk includes, but is not limited to, a direct mode and a RAID (Redundant array of Independent Disks) mode.
Fig. 1 is a schematic flow chart of a specific implementation manner of a hard disk failure detection method provided in an embodiment of the present application, which specifically includes:
s101, obtaining an operation log of each server node.
In this embodiment, the running log of any server node includes, but is not limited to, an operating system log of the server node within a preset time period.
The method for obtaining the operation log can be referred to in the prior art.
It should be noted that, in this embodiment, the method for determining whether the hard disk of each server node is the target hard disk is the same, and S102 to S104 take the first server node as an example and describe the method for determining whether the hard disk of the server node is the target hard disk.
S102, whether a preset strong correlation index item appears in the running log of the first server node is searched.
In this embodiment, the strong correlation index item is preconfigured according to the historical operation log, and in the historical operation log, if the strong correlation index item occurs, the probability of the hard disk failure is greater than a first preset probability threshold. Strongly related indicator items include, but are not limited to: unrecovered read error, NCQ fail, I/O error, Device offset, SCSI error, and file system read-only.
It should be noted that each strongly correlated index item indicates that a running error corresponding to the strongly correlated index item occurs in the running process of the server node, and generally, the strongly correlated index item is a preset script expression of the running error. The corresponding relationship between the strong correlation index item and the operation error can be referred to in the prior art, for example, the operation error corresponding to the strong correlation index item "unrecoverable read error" is an "unrecoverable error". For a specific operation error corresponding to each strongly correlated index, reference may be made to the prior art, and this embodiment is not described herein again.
S103, if any one of the strongly related index items appears, determining that the hard disk of the first server node is a target hard disk.
In this embodiment, at least one strong correlation index item is used as a first preset condition, and when the first preset condition is met, it is determined that the hard disk of the first server node is the target hard disk.
In this embodiment, if the running log of the first server node satisfies at least one of (11) existence of unregistered read error, (12) existence of NCQ fail, (13) existence of I/O error, (14) existence of Device offline, (15) existence of SCSI error, and (16) existence of file system read-only, it is determined that the running log of the server node satisfies the second preset condition.
It should be noted that the configuration condition of the strongly related index item is determined according to the historical operating log, the strongly related index item may further include other multiple index items, and the strongly related index item is updated in real time according to the historical operating log, which is not described in this embodiment.
And S104, determining the occurrence frequency of each preset secondary related index item according to the running log of the first server node.
In this embodiment, the method for determining the occurrence number of each preset secondary correlation index item may refer to the prior art, and optionally, the occurrence number of a keyword indicating the existence of the secondary correlation index item is searched for as the occurrence number of the existing secondary correlation index item.
In this embodiment, each time the related index item is preconfigured according to the historical operation log, in the historical operation log, the more the occurrence frequency of the secondary related index item is, the greater the probability of the hard disk failing is, and when the occurrence frequency of the secondary related index item is greater than a preset frequency threshold corresponding to the secondary related index item, the greater the probability of the hard disk failing is than a first preset probability threshold, where the preset frequency threshold is greater than 1.
Optionally, the secondary relevance indicator term includes, but is not limited to: lun Reset, Reset local, unregistered read error, and unregistered write error. It should be noted that each secondary correlation index item indicates that an operation error corresponding to the secondary correlation index item occurs in the operation process of the server node, and generally, the secondary correlation index item is a preset script expression of the operation error. For example, the operation error corresponding to the secondary relevance index term "Reset local" is a logical volume Reset. For a specific operation error corresponding to each related index item, reference may be made to the prior art, and this embodiment is not described herein again.
And S105, if the occurrence frequency of at least one secondary related index item is greater than a preset frequency threshold corresponding to the secondary related index item, taking the hard disk of the first server node as a target hard disk.
In this embodiment, the occurrence frequency of the at least one secondary correlation index item is greater than the preset frequency threshold corresponding to the secondary correlation index item, and when the second preset condition is met, it is determined that the hard disk of the first server node is the target hard disk. Specifically, if the running log of the first server node meets at least one of (11) the occurrence frequency of Lun Reset is greater than a preset first numerical value, (12) the occurrence frequency of Reset local is greater than a preset second numerical value, (13) the occurrence frequency of unregistered read error is greater than a preset third numerical value, and (14) the occurrence frequency of unregistered write error is greater than a fourth numerical value, it is determined that the running log of the server node meets a first preset condition.
It should be noted that, the preset time threshold corresponding to the secondary correlation index item is determined according to the historical operation log, and the values of the first numerical value to the fourth numerical value may be the same or different. The secondary related index items may also include other multiple index items, and the secondary related index items are updated in real time according to the historical operation log, which is not described in this embodiment. The execution sequence of S102 to S103 and S104 to S105 is not limited in the present application, that is, when the operation log of the first server node satisfies at least one of the first preset condition and the second preset condition, the hard disk of the server node is determined to be the target hard disk.
As can be seen from the above, in the historical operation log, if a strong related index item occurs, the probability of the hard disk failing is greater than the first preset probability threshold, and whenever at least one strong related index exists in the operation log of the server, the probability of the hard disk failing of the server is greater than the first preset probability threshold, and the frequency of the secondary related index item occurring is greater than the preset frequency threshold corresponding to the secondary related index item, the probability of the hard disk failing is greater than the first preset probability threshold, so, when the frequency of the at least one secondary related index item occurring in the operation log of the server is greater than the preset frequency threshold corresponding to the secondary related index item, the probability of the hard disk failing of the server is greater than the first preset probability threshold, and therefore, when the operation log of the first server node meets at least one of the first preset condition and the second preset condition, and determining that the hard disk of the server node is the target hard disk, wherein the target hard disk determined by the method is the hard disk of which the failure probability determined according to the operation log is greater than a first preset probability threshold.
In this embodiment, S102 to S105 determine whether the hard disk of each server node is a target hard disk, so as to obtain at least one target hard disk, and S106 to S109 illustrate a specific implementation process for determining whether the target hard disk fails.
S106, obtaining a numerical value of a first preset hard disk index item in the SMART log of the target hard disk, and determining that the target hard disk fails if the numerical value of the first preset hard disk index item is not 0.
In this embodiment, the first preset hard disk index item is an index item pre-configured according to the historical running log, and the configuration condition is: when the numerical value of the first preset hard disk index item is not 0, the probability of the target hard disk failing is greater than a second preset probability threshold.
S107, obtaining the numerical value of a second preset hard disk index item in the SMART log of the target hard disk, and determining that the target hard disk fails if the numerical value of the second preset hard disk index item is equal to the preset fault value of the second preset hard disk index item.
In this embodiment, the second preset hard disk index item is an index item pre-configured according to the historical running log, and the configuration condition is: and when the numerical value of the second preset hard disk index item is equal to the preset fault value of the second preset hard disk index item, the probability of the fault of the target hard disk is greater than a second preset probability threshold value.
S108, obtaining a numerical value of a third preset hard disk index item in the SMART log of the target hard disk, and if the numerical value of the third preset hard disk index item is not equal to a preset normal value of the third preset hard disk index item, determining that the target hard disk fails.
In this embodiment, the third preset hard disk index item is an index item pre-configured according to the historical running log, and the configuration condition is: and when the numerical value of the third preset hard disk index item is not equal to the preset normal value of the third preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
S109, obtaining a numerical value of a fourth preset hard disk index item in the SMART log of the target hard disk, and determining that the target hard disk fails if the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold corresponding to the fourth preset hard disk index item.
In this embodiment, the fourth preset hard disk index item is an index item configured in advance according to the historical running log, and the configuration condition is: the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold corresponding to the fourth preset hard disk index item, and the probability of the target hard disk failure is greater than a second preset probability threshold.
It should be noted that the four types of hard disk index items are all index items which are configured in advance through a history running log and are related to whether a target hard disk has a fault, and table 1 illustrates a corresponding relationship between specific index item contents of the four types of hard disk index items and a preset fault condition in practical application.
TABLE 1 corresponding relationship
Figure BDA0003012775940000091
Figure BDA0003012775940000101
Figure BDA0003012775940000111
The execution sequence of S106 to S109 is not limited in the present application, and when any step in S106 to S109 determines whether the target hard disk fails, the process is ended.
According to the technical scheme, the first to fourth preset hard disk index items are preset hard disk index items which are pre-configured according to the historical running log, and the configuration condition is that when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the numerical value of the preset hard disk index item, the fault probability of the target hard disk is greater than the second preset probability threshold value, so that the method has high accuracy of determining that the target hard disk has a fault under the condition that the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the numerical value of the preset hard disk index item according to the SMART log of the target hard disk.
It should be noted that, the specific configuration conditions of each preset hard disk index item are referred to the above embodiments.
The flow shown in fig. 1 is only a specific implementation manner of the hard disk failure detection method provided by the present application, and the present application also includes other specific implementation manners.
For example, the strong correlation index item and the secondary correlation index item are preset index items determined according to the historical operation log, and the specific preset index items are not limited to the index items shown in table 1 and table 2 in the above embodiment, and may also include other index items.
For another example, S106 to S109 are specific implementation methods for determining whether the target hard disk fails according to the SMART log of the target hard disk, and the present application also includes other implementation methods, which may specifically refer to the prior art.
In summary, in this embodiment, summarizing the probability of the hard disk failure detection method provided by the present application to the flow shown in fig. 2, specifically, the method may include:
s201, obtaining the running logs of each server node in the server cluster.
In this embodiment, the server cluster includes a plurality of server nodes, at least one hard disk is run on each server node, and the method for obtaining the running log of each server node can be completed by extracting the running log in a preset time period from the log storage space of the server, which may be specifically referred to in the prior art.
S202, determining the occurrence frequency of the preset index items of each server node according to the running log.
It should be noted that, methods for determining the occurrence number of the preset index items include multiple methods, and specific reference may be made to the above embodiments.
S203, if the occurrence frequency of at least one preset index item of the server node is greater than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk.
In this embodiment, the preset time threshold corresponding to each preset index item is determined according to the probability of the hard disk of the server node failing. And when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold.
It should be noted that, the configuration method of the preset index item and the specific method for determining the preset number threshold corresponding to the preset index item may be referred to in the prior art.
And S204, acquiring a SMART log of the target hard disk.
S205, detecting whether the target hard disk fails or not according to the SMART log of the target hard disk.
According to the technical scheme, the method comprises the steps of obtaining the running logs of each server node in the server cluster, determining the occurrence frequency of the preset index item of each server node according to the running logs, and taking the hard disk of the server node as the target hard disk if the occurrence frequency of at least one preset index item of the server node is not less than the preset frequency threshold corresponding to the preset index item. In the historical operation log of the server node, when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item, the probability of the hard disk of the server node failing is greater than the first preset probability threshold, so that the target hard disk is the hard disk with the probability of the failure greater than the first preset probability threshold. Therefore, the method only needs to acquire the SMART log of the target hard disk and detect whether the target hard disk fails or not according to the SMART log of the target hard disk. Because the time consumption for acquiring the operation logs of the server nodes is short and the efficiency is high, the method does not need to acquire the SMART log of the hard disk of each server node, the problem that the failure processing is not timely due to the fact that the time consumption for acquiring the SMART log is long is avoided, and the method also improves the efficiency of hard disk failure detection.
Fig. 3 is a schematic structural diagram of a hard disk failure detection apparatus according to an embodiment of the present application, and as shown in fig. 3, the apparatus may include:
a first log obtaining unit 301, configured to obtain an operation log of each server node in a server cluster;
a frequency obtaining unit 302, configured to determine, according to the running log of each server node, a frequency of occurrence of a preset index item of each server node;
a target hard disk determining unit 303, configured to, if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, use a hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
a second log obtaining unit 304, configured to obtain a SMART log of the target hard disk;
a failure determining unit 305, configured to detect whether the target hard disk fails according to the SMART log of the target hard disk.
Optionally, the preset index item includes: the system comprises a strong correlation index item and a secondary correlation index item, wherein a preset time threshold corresponding to the strong correlation index item is equal to 1, and a preset time threshold corresponding to the secondary correlation index item is larger than 1.
Optionally, the determining, by a target hard disk determining unit, if the occurrence frequency of at least one of the preset index items of the server node is not less than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as the target hard disk includes: the target hard disk determination unit is specifically configured to:
if at least one item of the strongly-related index item exists in the running log, taking a hard disk of the server node as a target hard disk;
if the strongly relevant index items do not exist in the running log, judging whether the occurrence frequency of each secondary relevant index item is greater than a preset frequency threshold corresponding to the secondary relevant index item;
and if the occurrence frequency of at least one secondary correlation index item is greater than a preset frequency threshold corresponding to the secondary correlation index item, determining that the hard disk of the server node is a target hard disk.
Optionally, the fault determining unit is configured to detect whether the target hard disk fails according to a SMART log of the target hard disk, and includes: the fault determination unit is specifically configured to:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
Optionally, the preset hard disk index items include a first preset hard disk index item, a second preset hard disk index item, a third preset hard disk index item and a fourth preset hard disk index item;
the preset fault condition corresponding to the first preset hard disk index item is as follows: the numerical value of the first preset hard disk index item is not 0;
the preset fault condition corresponding to the second preset hard disk index item is as follows: the numerical value of the second preset hard disk index item is equal to a preset fault value;
the preset fault condition corresponding to the third preset hard disk index item is as follows: the numerical value of the third preset hard disk index item is not equal to a preset normal value;
the preset fault condition corresponding to the fourth preset hard disk index item is as follows: and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
Optionally, the determining a fault of the target hard disk, if the value of at least one of the preset hard disk index items satisfies a preset fault condition corresponding to the preset index item, includes: the fault determination unit is specifically configured to:
if the following conditions are met: and determining that the target hard disk has a fault if at least one of the numerical values of the first preset hard disk index item is not 0, the numerical value of the second preset hard disk index item is equal to a first preset numerical value, the numerical value of the third preset hard disk index item is not equal to a second preset numerical value, and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
Fig. 4 shows a schematic structural diagram of a fault detection device of the hard disk, which may include: at least one processor 401, at least one communication interface 402, at least one memory 403 and at least one communication bus 404;
in the embodiment of the present application, the number of the processor 401, the communication interface 402, the memory 403 and the communication bus 404 is at least one, and the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404;
processor 401 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;
the memory 403 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
the memory stores a program, and the processor can execute the program stored in the memory, so as to implement the steps of the hard disk fault detection method provided by the embodiment of the application, as follows:
a fault detection method of a hard disk comprises the following steps:
acquiring an operation log of each server node in a server cluster;
determining the occurrence frequency of a preset index item of each server node according to the running log of each server node;
if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
acquiring a SMART log of the target hard disk;
and detecting whether the target hard disk fails or not according to the SMART log of the target hard disk.
Optionally, the preset index item includes: the system comprises a strong correlation index item and a secondary correlation index item, wherein a preset time threshold corresponding to the strong correlation index item is equal to 1, and a preset time threshold corresponding to the secondary correlation index item is larger than 1.
Optionally, if the occurrence frequency of at least one of the preset index items of the server node is greater than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk includes:
if at least one item of the strongly-related index item exists in the running log, taking a hard disk of the server node as a target hard disk;
if the strongly relevant index items do not exist in the running log, judging whether the occurrence frequency of each secondary relevant index item is greater than a preset frequency threshold corresponding to the secondary relevant index item;
and if the occurrence frequency of at least one secondary correlation index item is greater than a preset frequency threshold corresponding to the secondary correlation index item, determining that the hard disk of the server node is a target hard disk.
Optionally, detecting whether the target hard disk fails according to the SMART log of the target hard disk includes:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
Optionally, the preset hard disk index items include a first preset hard disk index item, a second preset hard disk index item, a third preset hard disk index item and a fourth preset hard disk index item;
the preset fault condition corresponding to the first preset hard disk index item is as follows: the numerical value of the first preset hard disk index item is not 0;
the preset fault condition corresponding to the second preset hard disk index item is as follows: the numerical value of the second preset hard disk index item is equal to a preset fault value;
the preset fault condition corresponding to the third preset hard disk index item is as follows: the numerical value of the third preset hard disk index item is not equal to a preset normal value;
the preset fault condition corresponding to the fourth preset hard disk index item is as follows: and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
Optionally, if the value of at least one preset hard disk index item meets a preset fault condition corresponding to the preset index item, determining that the target hard disk fails includes:
if the following conditions are met: and determining that the target hard disk has a fault if at least one of the numerical values of the first preset hard disk index item is not 0, the numerical value of the second preset hard disk index item is equal to a first preset numerical value, the numerical value of the third preset hard disk index item is not equal to a second preset numerical value, and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
An embodiment of the present application further provides a readable storage medium, where the readable storage medium may store a computer program suitable for being executed by a processor, and when the computer program is executed by the processor, the steps of the hard disk failure detection method provided in the embodiment of the present application are implemented as follows:
a fault detection method of a hard disk comprises the following steps:
acquiring an operation log of each server node in a server cluster;
determining the occurrence frequency of a preset index item of each server node according to the running log of each server node;
if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
acquiring a SMART log of the target hard disk;
and detecting whether the target hard disk fails or not according to the SMART log of the target hard disk.
Optionally, the preset index item includes: the system comprises a strong correlation index item and a secondary correlation index item, wherein a preset time threshold corresponding to the strong correlation index item is equal to 1, and a preset time threshold corresponding to the secondary correlation index item is larger than 1.
Optionally, if the occurrence frequency of at least one of the preset index items of the server node is greater than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk includes:
if at least one item of the strongly-related index item exists in the running log, taking a hard disk of the server node as a target hard disk;
if the strongly relevant index items do not exist in the running log, judging whether the occurrence frequency of each secondary relevant index item is greater than a preset frequency threshold corresponding to the secondary relevant index item;
and if the occurrence frequency of at least one secondary correlation index item is greater than a preset frequency threshold corresponding to the secondary correlation index item, determining that the hard disk of the server node is a target hard disk.
Optionally, detecting whether the target hard disk fails according to the SMART log of the target hard disk includes:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
Optionally, the preset hard disk index items include a first preset hard disk index item, a second preset hard disk index item, a third preset hard disk index item and a fourth preset hard disk index item;
the preset fault condition corresponding to the first preset hard disk index item is as follows: the numerical value of the first preset hard disk index item is not 0;
the preset fault condition corresponding to the second preset hard disk index item is as follows: the numerical value of the second preset hard disk index item is equal to a preset fault value;
the preset fault condition corresponding to the third preset hard disk index item is as follows: the numerical value of the third preset hard disk index item is not equal to a preset normal value;
the preset fault condition corresponding to the fourth preset hard disk index item is as follows: and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
Optionally, if the value of at least one preset hard disk index item meets a preset fault condition corresponding to the preset index item, determining that the target hard disk fails includes:
if the following conditions are met: and determining that the target hard disk has a fault if at least one of the numerical values of the first preset hard disk index item is not 0, the numerical value of the second preset hard disk index item is equal to a first preset numerical value, the numerical value of the third preset hard disk index item is not equal to a second preset numerical value, and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A fault detection method of a hard disk is characterized by comprising the following steps:
acquiring an operation log of each server node in a server cluster;
determining the occurrence frequency of a preset index item of each server node according to the running log of each server node;
if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
acquiring a SMART log of the target hard disk;
and detecting whether the target hard disk fails or not according to the SMART log of the target hard disk.
2. The method of claim 1, wherein the preset indicator item comprises: the system comprises a strong correlation index item and a secondary correlation index item, wherein a preset time threshold corresponding to the strong correlation index item is equal to 1, and a preset time threshold corresponding to the secondary correlation index item is larger than 1.
3. The method according to claim 2, wherein if the occurrence frequency of at least one of the preset index items of the server node is greater than a preset frequency threshold corresponding to the preset index item, taking the hard disk of the server node as a target hard disk includes:
if at least one item of the strongly-related index item exists in the running log, taking a hard disk of the server node as a target hard disk;
if the strongly relevant index items do not exist in the running log, judging whether the occurrence frequency of each secondary relevant index item is greater than a preset frequency threshold corresponding to the secondary relevant index item;
and if the occurrence frequency of at least one secondary correlation index item is greater than a preset frequency threshold corresponding to the secondary correlation index item, determining that the hard disk of the server node is a target hard disk.
4. The method according to claim 1 or 3, wherein the detecting whether the target hard disk fails according to the SMART log of the target hard disk comprises:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
5. The method of claim 4, wherein the preset hard disk index items comprise a first preset hard disk index item, a second preset hard disk index item, a third preset hard disk index item, and a fourth preset hard disk index item;
the preset fault condition corresponding to the first preset hard disk index item is as follows: the numerical value of the first preset hard disk index item is not 0;
the preset fault condition corresponding to the second preset hard disk index item is as follows: the numerical value of the second preset hard disk index item is equal to a preset fault value;
the preset fault condition corresponding to the third preset hard disk index item is as follows: the numerical value of the third preset hard disk index item is not equal to a preset normal value;
the preset fault condition corresponding to the fourth preset hard disk index item is as follows: and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
6. The method according to claim 5, wherein the determining that the target hard disk fails if the value of the at least one preset hard disk index item satisfies a preset failure condition corresponding to the preset index item comprises:
if the following conditions are met: and determining that the target hard disk has a fault if at least one of the numerical values of the first preset hard disk index item is not 0, the numerical value of the second preset hard disk index item is equal to a first preset numerical value, the numerical value of the third preset hard disk index item is not equal to a second preset numerical value, and the numerical value of the fourth preset hard disk index item exceeds a preset numerical value threshold.
7. A failure detection apparatus for a hard disk, comprising:
the first log acquiring unit is used for acquiring the running logs of each server node in the server cluster;
the frequency acquisition unit is used for determining the occurrence frequency of a preset index item of each server node according to the running log of each server node;
a target hard disk determining unit, configured to, if the occurrence frequency of at least one preset index item of the server node is not less than a preset frequency threshold corresponding to the preset index item, use a hard disk of the server node as a target hard disk; the preset frequency threshold corresponding to the preset index item is configured according to the probability of the hard disk of the server node failing, and when the occurrence frequency of the preset index item is not less than the preset frequency threshold corresponding to the preset index item in the historical running log of the server node, the probability of the hard disk of the server node failing is greater than a first preset probability threshold;
a second log obtaining unit, configured to obtain a SMART log of the target hard disk;
and the fault determining unit is used for detecting whether the target hard disk has a fault or not according to the SMART log of the target hard disk.
8. The apparatus of claim 7, wherein the failure determination unit is configured to detect whether the target hard disk fails according to a SMART log of the target hard disk, and includes: the fault determination unit is specifically configured to:
determining the numerical value of a preset hard disk index item according to the SMART log of the target hard disk;
and if the numerical value of at least one preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, determining that the target hard disk fails, wherein the preset fault condition corresponding to the numerical value of each preset index item is configured according to the probability of the target hard disk failing, and in the historical SMART log of the target hard disk, when the numerical value of the preset hard disk index item meets the preset fault condition corresponding to the preset hard disk index item, the probability of the target hard disk failing is greater than a second preset probability threshold.
9. A failure detection apparatus of a hard disk, characterized by comprising: a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the method for detecting a failure in a hard disk according to any one of claims 1 to 6.
10. A readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps of the method for detecting a failure of a hard disk according to any one of claims 1 to 6.
CN202110380576.5A 2021-04-09 2021-04-09 Hard disk fault detection method, device, equipment and readable storage medium Pending CN113077835A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110380576.5A CN113077835A (en) 2021-04-09 2021-04-09 Hard disk fault detection method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110380576.5A CN113077835A (en) 2021-04-09 2021-04-09 Hard disk fault detection method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113077835A true CN113077835A (en) 2021-07-06

Family

ID=76615674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110380576.5A Pending CN113077835A (en) 2021-04-09 2021-04-09 Hard disk fault detection method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113077835A (en)

Similar Documents

Publication Publication Date Title
US9424157B2 (en) Early detection of failing computers
CN114579340A (en) Memory error processing method and device
US10698605B2 (en) Multipath storage device based on multi-dimensional health diagnosis
CN111949443A (en) Hard disk fault processing method, array controller and hard disk
WO2017020614A1 (en) Disk detection method and device
US11676671B1 (en) Amplification-based read disturb information determination system
EP4180959A1 (en) Memory failure processing method and apparatus
CN112579327B (en) Fault detection method, device and equipment
CN111782640B (en) Data processing method and system of cloud platform, electronic equipment and storage medium
CN112951311A (en) Hard disk fault prediction method and system based on variable weight random forest
CN110647472A (en) Breakdown information statistical method and device, computer equipment and storage medium
US10866875B2 (en) Storage apparatus, storage system, and performance evaluation method using cyclic information cycled within a group of storage apparatuses
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN110688846B (en) Periodic word mining method, system, electronic equipment and readable storage medium
CN113590405A (en) Hard disk error detection method and device, storage medium and electronic device
CN113077835A (en) Hard disk fault detection method, device, equipment and readable storage medium
US7546489B2 (en) Real time event logging and analysis in a software system
US11929135B2 (en) Read disturb information determination system
US11922020B2 (en) Read-disturb-based read temperature information persistence system
CN115269288A (en) Fault determination method, device, equipment and storage medium
CN113409876A (en) Method and system for positioning fault hard disk
CN111223516B (en) RAID card detection method and device
CN104239236A (en) Translation lookaside buffer and method for handling translation lookaside buffer deficiency
CN107301073B (en) Configuration information retrieval method and device based on solid state disk system
CN112988442A (en) Method and equipment for transmitting fault information in server operation stage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination