CN112732517B - Disk fault alarm method, device, equipment and readable storage medium - Google Patents

Disk fault alarm method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN112732517B
CN112732517B CN202011601555.3A CN202011601555A CN112732517B CN 112732517 B CN112732517 B CN 112732517B CN 202011601555 A CN202011601555 A CN 202011601555A CN 112732517 B CN112732517 B CN 112732517B
Authority
CN
China
Prior art keywords
disk
feature
fault
data
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011601555.3A
Other languages
Chinese (zh)
Other versions
CN112732517A (en
Inventor
张中文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Inspur Data Technology Co Ltd
Original Assignee
Beijing Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Inspur Data Technology Co Ltd filed Critical Beijing Inspur Data Technology Co Ltd
Priority to CN202011601555.3A priority Critical patent/CN112732517B/en
Publication of CN112732517A publication Critical patent/CN112732517A/en
Application granted granted Critical
Publication of CN112732517B publication Critical patent/CN112732517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a disk fault warning method, which collects characteristic data and health status of disks belonging to a storage system cluster, wherein the operation status of the disks in a storage system is similar, when a disk fault occurs in the storage system, other disks are in fault critical points in a high probability, so that fault prediction of the other disks is realized based on the disk characteristics corresponding to the failed disk, if a disk conforming to the disk characteristics of the failed disk exists in the currently running disk, the disk is judged to have higher fault risk, so that related technicians can perform fault investigation and disk operation and maintenance on the fault critical disk in time, the influence on the system operation stability after the fault critical disk is broken is avoided, and meanwhile, the disk operation and maintenance pressure of the related technicians can be reduced. The invention also discloses a disk fault alarm device, equipment and a readable storage medium, which have corresponding technical effects.

Description

Disk fault alarm method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of computer applications, and in particular, to a method, apparatus, device, and readable storage medium for alarming a disk failure.
Background
As informatization continues to go deep, the demand for data storage increases and the size of distributed storage systems continues to expand. Correspondingly, the read-write frequency and the read-write times of the magnetic disk are higher and higher, the magnetic disk faults are inevitably generated in the using process of the magnetic disk, and the magnetic disk fault alarming is particularly important for prompting technicians to repair the magnetic disk faults in time.
The hysteresis is higher in the current disk fault alarm scheme, the faults generally have burstiness, and the repair of the burstiness faults has higher requirements for technicians; after the disk is in fault, the fault disk inevitably has negative influence on the storage system, influences the normal operation of the storage system and has influence on the stable operation of the system.
In summary, how to implement fast processing of disk failures and ensure the operation stability of the storage system is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a disk fault alarming method, a device, equipment and a readable storage medium, which can realize the rapid processing of disk faults and ensure the running stability of a storage system.
In order to solve the technical problems, the invention provides the following technical scheme:
a disk failure warning method comprising:
collecting disk characteristic data of each node in the cluster to obtain a characteristic library;
monitoring the disk health state of the node;
if the first disk is determined to be faulty through the monitoring, determining feature data corresponding to the first disk from the feature library, and taking the feature data as target feature data;
performing feature matching in the feature library according to the target feature data to obtain feature matched magnetic disks serving as fault critical magnetic disks;
and performing fault alarm on the first magnetic disk and performing fault early warning on the fault critical magnetic disk.
Optionally, the collecting disk feature data of each node in the cluster to obtain a feature library includes:
disk characteristic data of all nodes in the cluster are collected at fixed time;
and counting the characteristic data corresponding to each disk in each acquisition time according to the disk characteristic data to obtain a characteristic library.
Optionally, the periodically collecting disk characteristic data of all nodes in the cluster includes:
disk damage state data of all nodes in the cluster are collected at fixed time and used as the disk characteristic data; wherein the disk damage status data includes a number of damaged sectors.
Optionally, if the monitoring determines that the first disk fails, determining feature data corresponding to the first disk from the feature library as target feature data, where the determining includes:
if the first disk is determined to be faulty through the monitoring, determining the number of damaged sectors corresponding to the first disk from the feature library, and taking the number of damaged sectors as a damaged sector threshold;
determining the damage number range of the critical fault sector as the target characteristic data according to a preset investigation range and the damage sector threshold;
correspondingly, performing feature matching according to the target feature data in the feature library to obtain a feature matched disk, wherein the disk is used as a fault critical disk and comprises the following steps:
and screening out the magnetic disks with the number of damaged sectors belonging to the range of the number of damaged sectors from the feature library, and taking the magnetic disks as the fault critical magnetic disks.
Optionally, if the monitoring determines that the first disk fails, determining feature data corresponding to the first disk from the feature library as target feature data, where the determining includes:
if the first disk faults are determined through the monitoring, the number of damaged sectors corresponding to the first disk in a preset time range is obtained from the feature library;
according to the number of the damaged sectors, counting and generating the increase condition of the number of damaged sectors corresponding to the first disk as the target characteristic data;
correspondingly, performing feature matching according to the target feature data in the feature library to obtain a feature matched disk, wherein the disk is used as a fault critical disk and comprises the following steps:
respectively counting the quantity increase condition corresponding to the quantity of damaged sectors of each disk in the feature library;
performing growth change rule matching according to the number growth conditions corresponding to the disks and the number of damaged sectors corresponding to the first disk to obtain a disk with the growth rule matching;
and taking the disk with the growth rule matched as the fault critical disk.
Optionally, performing growth change rule matching according to the number growth condition corresponding to each disk and the number growth condition of the sector damage corresponding to the first disk includes:
and performing growth rate matching according to the number growth condition corresponding to each disk and the number growth condition of the sector damage corresponding to the first disk.
Optionally, performing feature matching according to the target feature data in the feature library to obtain a feature matched disk, where the feature matched disk is used as a fault critical disk, and the method includes:
obtaining a disk with the same disk model as the first disk in the feature library as a disk to be selected;
and performing feature matching on the feature data corresponding to the disk to be selected in the feature library according to the target feature data to obtain a disk with the matched feature, and taking the disk as a fault critical disk.
A disk failure warning apparatus comprising:
the data collection unit is used for collecting disk characteristic data of each node in the cluster to obtain a characteristic library;
the state monitoring unit is used for monitoring the health state of the magnetic disk of the node;
the data extraction unit is used for determining the characteristic data corresponding to the first disk from the characteristic library as target characteristic data if the first disk is determined to be faulty through the monitoring;
the feature matching unit is used for carrying out feature matching in the feature library according to the target feature data to obtain a feature matched disk which is used as a fault critical disk;
and the fault alarm unit is used for carrying out fault alarm on the first magnetic disk and carrying out fault early warning on the fault critical magnetic disk.
A disk failure warning apparatus comprising:
a memory for storing a computer program;
and the processor is used for realizing the steps of the disk fault alarming method when executing the computer program.
A readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the above-described disk failure warning method.
The method provided by the embodiment of the invention collects the characteristic data and the health state of the disks belonging to the same storage system cluster, keeps the balanced state when the disks are used in the same storage system, has the same running environment among the disks, and has the high probability that other disks are at the fault critical point when the disks in the storage system are in fault, so that the fault prediction of other disks is realized based on the disk characteristics corresponding to the failed disks, if the disks which are in operation at present and conform to the disk characteristics of the failed disks exist in the disks, the disks are judged to have higher fault risk, so that the prediction and the early warning of the disk faults are realized, the fault detection and the disk operation and the maintenance of the fault critical disks can be conveniently carried out in time by related technicians, the influence on the system running stability after the fault critical disks are in fault is avoided, and the disk operation and maintenance pressure of the related technicians can be reduced.
Correspondingly, the embodiment of the invention also provides a disk fault alarming device, equipment and a readable storage medium corresponding to the disk fault alarming method, which have the technical effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a flowchart of a disk failure warning method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of feature collection and alarm in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a disk failure alarm device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a disk failure alarm device according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide a disk fault alarm method which can realize the rapid processing of disk faults and ensure the running stability of a storage system.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart of a disk failure warning method according to an embodiment of the present invention, the method includes the following steps:
s101, collecting disk characteristic data of each node in a cluster to obtain a characteristic library;
and collecting disk characteristic data of each node in the cluster, wherein one node possibly comprises more than one disk, and then the disk characteristic data corresponding to all the disks under each node is required to be collected. In order to ensure the comprehensiveness of data collection, disk characteristic data of all disks arranged in all nodes in the cluster can be collected, disk characteristic data corresponding to designated nodes (or designated disks) in the cluster can also be collected, and the nodes for collecting the disk characteristic data and the disk can be arranged according to monitoring objects of actual disk faults.
In this embodiment, the data type of the specifically collected disc feature data is not limited, for example, may include a disc state (such as a sector damage condition, a sector data read-write rate, etc.), a disc model, a disc occupation size, etc., and the configuration of the disc feature data may be performed according to a disc failure prediction mode, which is not described herein again.
The disk characteristic data of each node in the cluster is collected, the disk characteristic data can be classified and stored according to the nodes, the characteristic data corresponding to each disk under each node can be directly listed, the information statistical mode in the characteristic library is not limited in the embodiment, and corresponding setting can be performed according to the actual viewing requirement.
In addition, the disk characteristic data of each node in the collection cluster can be collected regularly or only once, if the variation of the disk characteristic data is small, the disk characteristic data can be collected only once, and if the variation of the disk characteristic data is obvious, the disk characteristic data can be collected regularly or in real time, so that the variation of the characteristic can be perceived in time, and the effective evaluation of the disk faults can be realized.
S102, monitoring the health state of a disk of a node;
the disk health state of the node is monitored to determine the health state of the disk in the node, such as normal operation of the disk 1 in the node 1, failure of the disk 2 and the like, and the disk failure can be quickly found by monitoring the disk health state of the node, so that the disk failure can be processed in time.
The health state of the magnetic disk of the monitoring node can be monitored in real time or detected at fixed time, so that even if the magnetic disk is perceived in fault, the monitoring time interval of the health state of the magnetic disk is not suitable to be too long.
The specific implementation step of monitoring the disk health status of the node is not limited in this embodiment, and may refer to an implementation manner in the related art, which is not described herein.
S103, if the first disk is monitored and determined to be faulty, determining feature data corresponding to the first disk from a feature library, and taking the feature data as target feature data;
analysis shows that the same type and batch of magnetic disks are generally used in the storage system, and the utilization rate of the magnetic disks is kept in an equilibrium state in the distributed system. When a disk failure occurs in the storage system, other disks are in a failure critical point with high probability. Aiming at the problem, in the method for predicting the disk faults, the fault prediction of other disks is realized based on the disk characteristics corresponding to the failed disk, and if the disk which accords with the disk characteristics of the failed disk exists in the currently running disk, the disk is judged to have higher fault risk, so that the prediction of the disk faults is realized.
For implementation purposes, first, the feature data of the first disk that has failed currently needs to be searched from the feature library that has collected the disk feature data of all nodes in the cluster, and the feature data is used as target feature data, so that the feature data of other disks are matched based on the target feature data.
S104, performing feature matching in a feature library according to the target feature data to obtain a feature matched disk serving as a fault critical disk;
the feature library stores the disk feature data of each disk corresponding to each node, and the disk feature data is respectively matched with the target feature data, wherein, all the disks in the feature library can be subjected to feature comparison, and the designated disk can also be selected for feature comparison. In order to improve the accuracy of fault prediction and reduce the workload of feature matching, only the disk with the same type as the disk of the first disk in the feature library can be obtained and used as a disk to be selected; and performing feature matching on feature data corresponding to the magnetic disk to be selected according to the target feature data in the feature library.
In addition, in this embodiment, the comparison method and the evaluation amount specifically adopted in the feature matching are not limited, for example, the number of damaged sectors may be used as the evaluation amount, and if the number of damaged sectors is the same as that of the first disk, the feature matching is determined. In this embodiment, only the above case is taken as an example for description, and other implementation manners of feature comparison based on this embodiment can refer to description of this embodiment, which is not repeated here.
If feature matching is performed in the feature library according to the target feature data to obtain a feature matching disk, the disk is indicated to have the same disk features as the failed disk, so that the failure risk of the disk is high.
It should be noted that, in this embodiment, the processing manner of the disk in which feature matching is performed according to the target feature data in the feature library and feature matching is not obtained is not limited, and only the first disk may be subjected to fault alarm, or feature matching may be performed again, and corresponding settings may be performed according to the requirement of actual fault prediction, which is not described herein.
S105, performing fault alarm on the first magnetic disk and performing fault early warning on the fault critical magnetic disk.
The first disk is judged to have faults, so that actual fault alarming is carried out on the first disk to indicate that the first disk has faults; the fault critical disk is a disk which is predicted to be likely to be in fault through the embodiment, fault prediction warning can be carried out on the fault critical disk as shown in fig. 2, so that the fault critical disk is poor in operation state and is likely to be in a fault critical point, and the fault critical disk belongs to a disk which is likely to be in fault, so that relevant technicians can conduct fault investigation and disk operation and maintenance on the fault critical disk in time, influence on system operation stability after the fault critical disk is prevented, and meanwhile, the disk operation and maintenance pressure of the relevant technicians can be reduced.
In this embodiment, the execution time node for performing fault alarm on the first disk and performing fault early warning on the fault critical disk, that is, the triggering mode is not limited, and after determining the fault critical disk, the first disk and the fault critical disk may be simultaneously alarmed, or after monitoring and determining that the first disk has a fault, the fault alarm may be immediately triggered, and after determining that the fault critical disk has a fault, and determining which disk has a fault, the fault critical disk is then performing fault early warning on the fault critical disk. The above two alarm implementations are described by taking the two implementations as examples only, and of course, other triggering modes may also be adopted, and reference may be made to the above description, which is not repeated here.
Based on the description, the technical scheme provided by the embodiment of the invention collects the characteristic data and the health state of the disks belonging to the same storage system cluster, the use of the disks in the same storage system is kept in an equilibrium state, the running environments among the disks are the same, when the disks in the storage system fail, other disks are in a failure critical point with high probability, so that the failure prediction of other disks is realized based on the disk characteristics corresponding to the failed disks, if the disks which are in operation currently exist in the disks which are in accordance with the disk characteristics of the failed disks, the disks are judged to have higher failure risk, so that the prediction and early warning of the disk failure are realized, the failure critical disks can be conveniently and timely subjected to failure check and disk operation and maintenance by related technicians, the influence on the system operation stability after the failure critical disks fail is avoided, and the disk operation and maintenance pressure of the related technicians can be reduced.
It should be noted that, based on the above embodiments, the embodiments of the present invention further provide corresponding improvements. The preferred/improved embodiments relate to the same steps as those in the above embodiments or the steps corresponding to the steps may be referred to each other, and the corresponding advantages may also be referred to each other, so that detailed descriptions of the preferred/improved embodiments are omitted herein.
In the above embodiment, the implementation manner of collecting the disk feature data of each node in the cluster to obtain the feature library is not limited, and a data collection manner and a feature matching implementation manner of the feature library are described in this embodiment.
The method for collecting the disk characteristic data of each node in the cluster and obtaining the characteristic library comprises the following implementation modes:
(1) Disk characteristic data of all nodes in the cluster are collected at fixed time;
(2) And counting the characteristic data corresponding to each disk in each acquisition time according to the characteristic data of the disk to obtain a characteristic library.
The data collection mode adopts a mode of timing collection and time statistics, the timing collection can ensure the timely update of the disk characteristic data, the fault prediction accuracy is improved, the time change rule of the characteristics can be enhanced by carrying out statistics on the characteristic data according to the collection time, and the analysis accuracy is improved.
However, in the above data collection method, the data type of the specifically collected disk characteristic data is not limited, and since the failure of the current disk is closely related to the damage state, the more serious the damage is, the higher the possibility of the disk failure is, so that the evaluation and prediction of the failure can be performed with respect to the disk damage state data. Further, in the disk damage state, the number of damaged sectors is an important evaluation value, and the probability of disk faults is greater as the number of damaged sectors is greater, so that the disk damage state data (including the number of damaged sectors) of all nodes in the cluster can be collected at regular time as disk characteristic data, so that the fault prediction accuracy is improved, and meanwhile, the difficulty of data analysis is reduced.
The implementation mode of screening the target characteristic data based on the characteristic library of the damaged sector quantity statistics is as follows:
(1) If the first disk is monitored and determined to be faulty, determining the number of damaged sectors corresponding to the first disk from a feature library, and taking the number of damaged sectors as a damaged sector threshold;
for example, if the number of damaged sectors of the first disk in failure is 6, the 6 can be used as a damaged sector threshold value, and the probability of damage to the disk with the number of damaged sectors of 6 is high in a similar operating environment.
(2) Determining the damage quantity range of the critical fault sector as target characteristic data according to a preset investigation range and a damage sector threshold value;
if the preset checking range is + -1, when the threshold value of the damaged sector is 6, the generated sector damage number range of the critical fault is 6 + -1, namely [5,7], and the [5,7] is taken as the target characteristic data.
Correspondingly, based on the screening of the target feature data, feature matching is performed in a feature library according to the target feature data, and a feature matched disk is obtained and used as a fault critical disk, and one implementation mode is as follows:
(3) And screening out the magnetic disks with the number of the damaged sectors belonging to the range of the number of the damaged sectors from the feature library, and taking the magnetic disks as fault critical magnetic disks.
For example, the number of damaged sectors of the currently running disk 1 in the feature library is 5, 5E [5,7]]Judging that the number of damaged sectors of the disk 1 belongs to the range of the number of damaged sectors, and taking the disk 1 as a fault critical disk obtained by judgment; in addition, the number of damaged sectors of the disk 2 currently in operation is 4, it is determined that the number of damaged sectors of the disk 2 does not belong to the range of the number of damaged sectors, and the disk 2 does not serve as a failure critical disk.
The evaluation amount in the screening and feature matching implementation mode is easy to obtain, the feature matching implementation mode is simple, the evaluation accuracy is high, and accurate disk fault prediction can be realized.
Based on the feature library of the damaged sector number statistics, another implementation mode of screening target feature data is as follows:
(1) If the first disk faults are determined through monitoring, the number of damaged sectors corresponding to the first disk in a preset time range is obtained from a feature library;
the preset time range is a time range for carrying out data statistics on the first disk, and can be a set time period, such as a certain period of time before a fault occurs; and the time range setting can be performed for all time nodes obtained by statistics in the feature library according to the accuracy of actual feature matching, and the detailed description is omitted.
(2) According to the number of damaged sectors, counting and generating the increase condition of the number of damaged sectors corresponding to the first disk as target characteristic data;
for example, the number of damaged sectors of the first disk is 1 at the first time, the number of damaged sectors is 1 at the second time, the number of damaged sectors is 1 at the third time, the number of damaged sectors is 5 at the fourth time, the number of damaged sectors is 6 at the fifth time, and the disk failure is determined at the fifth time, so that the above data can be used as target characteristic data.
Correspondingly, based on the screening of the target feature data, feature matching is performed in a feature library according to the target feature data, and a feature matched disk is obtained and used as a fault critical disk, and one implementation mode is as follows:
(3) Respectively counting the number increment condition corresponding to the number of damaged sectors of each disk in a feature library;
and counting the sector damage quantity increase conditions of other disks in the feature library according to the sector damage quantity increase condition statistical mode of the first failed disk, wherein the statistical time ranges of the two types of the sector damage quantity increase conditions can be different.
(4) Performing growth change rule matching according to the number growth condition corresponding to each disk and the number growth condition of the sector damage corresponding to the first disk to obtain a disk with the growth rule matching;
a growing change rule, such as a sudden increase of more than 200% in the number of damaged sectors at a certain time node, and a 10% increase in the number of damaged sectors at regular intervals, is adaptively generated according to the data, and the above are just two examples.
(5) And taking the disk with the growth rule matching as a fault critical disk.
If the feature library exists in the disk with the matched growth rule of the first disk, the similarity between the running development track of the disk and the first chassis of the fault is proved to be higher, and the running development track of the disk and the first chassis of the fault possibly have the same running result (namely the fault) as the first disk, so that the disk with the matched growth rule can be used as the critical disk of the fault.
The number of index items which can be evaluated in the growth rule is more, such as a growth average value, a maximum value and the like, can be correspondingly set according to the actual comparison requirement, can be matched with the growth rate according to the number growth condition corresponding to each disk and the number growth condition of the damaged sectors corresponding to the first disk, can represent the overall change condition in a longer time range, can exclude the interference of different operation nodes, and realizes high-precision analysis.
The implementation method realizes analysis of the deep running development track through the sector damage growth rule, and can realize more accurate feature matching, but the data analysis process is more difficult than the implementation method.
It should be noted that, in this embodiment, the generation and matching process of the feature library is described by taking the implementation manner as an example, and other implementation manners based on this application may refer to the description of this embodiment, which is not repeated herein.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a disk failure warning device, where the disk failure warning device described below and the disk failure warning method described above may be referred to correspondingly.
Referring to fig. 3, the apparatus includes the following units:
the data collection unit 110 is mainly used for collecting disk characteristic data of each node in the cluster to obtain a characteristic library;
the state monitoring unit 120 is mainly used for monitoring the health state of the disk of the node;
the data extraction unit 130 is mainly configured to determine, if the first disk is monitored and determined to be faulty, feature data corresponding to the first disk from the feature library, as target feature data;
the feature matching unit 140 is mainly used for performing feature matching according to the target feature data in the feature library to obtain a feature matched disk as a fault critical disk;
the fault alarm unit 150 is mainly used for performing fault alarm on the first disk and performing fault early warning on the fault critical disk.
It should be noted that, in this embodiment, the functional division and limitation of the functions of the units are not limited, and for the sake of understanding, another unit setting mode of the apparatus is described herein, and as follows, the apparatus mainly includes three units: the device comprises a disk characteristic acquisition unit, a fault detection unit and a characteristic matching unit.
The disk characteristic acquisition unit runs on all nodes in the cluster in a daemon mode, collects disk characteristics (such as the number of damaged sectors) on all nodes in a timing acquisition mode, and records disk numbers, time and characteristic data to generate a characteristic library;
the fault detection unit runs on all nodes in the cluster in a daemon mode, and checks the health state of the disks on all nodes by using a timing detection mode. When the disk fails or is not available, triggering a disk failure alarm;
and the feature matching unit is used for acquiring the feature information of the failed disk from the disk feature library when the disk fails, matching all data in the feature library, and if feature matching exists, considering the disk to be in a failure critical state and triggering disk failure warning.
Based on the above-mentioned unit division manner, one cooperation manner of the device is as follows: the feature acquisition unit acquires all disc information in the cluster and generates a feature database. When the fault detection unit detects that a disk fault occurs, the trigger feature matching unit acquires the feature information of the failed disk and matches the history record of the feature database. And if the feature matching data exists, triggering the disk prediction fault alarm.
In this embodiment, only the unit dividing manner of the two devices is described as an example, and other unit dividing manners according to this embodiment may refer to the description above, and are not described herein again.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a disk failure warning device, where a disk failure warning device described below and a disk failure warning method described above may be referred to correspondingly.
The disk failure warning apparatus includes:
a memory for storing a computer program;
and the processor is used for realizing the steps of the disk fault warning method of the method embodiment when executing the computer program.
Specifically, referring to fig. 4, a schematic diagram of a specific structure of a disk failure alarm device according to the present embodiment may be provided, where the disk failure alarm device may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer applications 342 or data 344. Wherein the memory 332 may be transient storage or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a series of instruction operations in the data processing apparatus. Still further, the central processor 322 may be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the disk failure warning device 301.
The disk failure warning device 301 can also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input output interfaces 358, and/or one or more operating systems 341.
The steps in the disk failure warning method described above may be implemented by the structure of the disk failure warning apparatus.
Corresponding to the above method embodiments, the embodiments of the present invention further provide a readable storage medium, where a readable storage medium described below and a disk failure warning method described above may be referred to correspondingly.
A readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the disk failure warning method of the above method embodiment.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, and the like.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not intended to be limiting.

Claims (9)

1. A disk failure warning method, comprising:
collecting disk characteristic data of each node in the cluster to obtain a characteristic library;
monitoring the disk health state of the node;
if the first disk is determined to be faulty through the monitoring, determining feature data corresponding to the first disk from the feature library, and taking the feature data as target feature data;
performing feature matching in the feature library according to the target feature data to obtain feature matched magnetic disks serving as fault critical magnetic disks;
performing fault alarm on the first disk and performing fault early warning on the fault critical disk;
if the first disk fault is determined through the monitoring, determining feature data corresponding to the first disk from the feature library as target feature data, wherein the method comprises the following steps:
if the first disk faults are determined through the monitoring, the number of damaged sectors corresponding to the first disk in a preset time range is obtained from the feature library;
according to the number of the damaged sectors, counting and generating the increase condition of the number of damaged sectors corresponding to the first disk as the target characteristic data;
correspondingly, performing feature matching according to the target feature data in the feature library to obtain a feature matched disk, wherein the disk is used as a fault critical disk and comprises the following steps:
respectively counting the quantity increase condition corresponding to the quantity of damaged sectors of each disk in the feature library;
performing growth change rule matching according to the number growth conditions corresponding to the disks and the number of damaged sectors corresponding to the first disk to obtain a disk with the growth rule matching; and taking the disk with the growth rule matched as the fault critical disk.
2. The disk fault warning method according to claim 1, wherein the collecting disk feature data of each node in the cluster to obtain the feature library includes:
disk characteristic data of all nodes in the cluster are collected at fixed time;
and counting the characteristic data corresponding to each disk in each acquisition time according to the disk characteristic data to obtain a characteristic library.
3. The disk fault warning method according to claim 2, wherein the periodically collecting disk characteristic data of all nodes in the cluster includes:
disk damage state data of all nodes in the cluster are collected at fixed time and used as the disk characteristic data; wherein the disk damage status data includes a number of damaged sectors.
4. The disk failure warning method according to claim 3, wherein if the monitoring determines that the first disk fails, determining feature data corresponding to the first disk from the feature library as target feature data includes:
if the first disk is determined to be faulty through the monitoring, determining the number of damaged sectors corresponding to the first disk from the feature library, and taking the number of damaged sectors as a damaged sector threshold;
determining the damage number range of the critical fault sector as the target characteristic data according to a preset investigation range and the damage sector threshold;
correspondingly, performing feature matching according to the target feature data in the feature library to obtain a feature matched disk, wherein the disk is used as a fault critical disk and comprises the following steps:
and screening out the magnetic disks with the number of damaged sectors belonging to the range of the number of damaged sectors from the feature library, and taking the magnetic disks as the fault critical magnetic disks.
5. The disk failure warning method according to claim 1, wherein the performing the growth change rule matching according to the number growth condition corresponding to each disk and the sector damage number growth condition corresponding to the first disk includes:
and performing growth rate matching according to the number growth condition corresponding to each disk and the number growth condition of the sector damage corresponding to the first disk.
6. The disk fault warning method according to claim 1, wherein performing feature matching according to the target feature data in the feature library to obtain a feature matched disk as a fault critical disk comprises:
obtaining a disk with the same disk model as the first disk in the feature library as a disk to be selected;
and performing feature matching on the feature data corresponding to the disk to be selected in the feature library according to the target feature data to obtain a disk with the matched feature, and taking the disk as a fault critical disk.
7. A disk failure warning apparatus, comprising:
the data collection unit is used for collecting disk characteristic data of each node in the cluster to obtain a characteristic library;
the state monitoring unit is used for monitoring the health state of the magnetic disk of the node;
the data extraction unit is used for determining the characteristic data corresponding to the first disk from the characteristic library as target characteristic data if the first disk is determined to be faulty through the monitoring;
the feature matching unit is used for carrying out feature matching in the feature library according to the target feature data to obtain a feature matched disk which is used as a fault critical disk;
the fault alarm unit is used for carrying out fault alarm on the first magnetic disk and carrying out fault early warning on the fault critical magnetic disk;
the data extraction unit is specifically configured to:
if the first disk faults are determined through the monitoring, the number of damaged sectors corresponding to the first disk in a preset time range is obtained from the feature library;
according to the number of the damaged sectors, counting and generating the increase condition of the number of damaged sectors corresponding to the first disk as the target characteristic data;
the feature matching unit is correspondingly specifically configured to:
respectively counting the quantity increase condition corresponding to the quantity of damaged sectors of each disk in the feature library;
performing growth change rule matching according to the number growth conditions corresponding to the disks and the number of damaged sectors corresponding to the first disk to obtain a disk with the growth rule matching; and taking the disk with the growth rule matched as the fault critical disk.
8. A disk failure warning apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the disk failure warning method according to any one of claims 1 to 6 when executing the computer program.
9. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the disk failure warning method according to any one of claims 1 to 6.
CN202011601555.3A 2020-12-29 2020-12-29 Disk fault alarm method, device, equipment and readable storage medium Active CN112732517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011601555.3A CN112732517B (en) 2020-12-29 2020-12-29 Disk fault alarm method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011601555.3A CN112732517B (en) 2020-12-29 2020-12-29 Disk fault alarm method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112732517A CN112732517A (en) 2021-04-30
CN112732517B true CN112732517B (en) 2023-12-22

Family

ID=75609987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011601555.3A Active CN112732517B (en) 2020-12-29 2020-12-29 Disk fault alarm method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112732517B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722179B (en) * 2021-08-13 2024-02-13 浪潮电子信息产业股份有限公司 Method, system and device for monitoring health state of magnetic disk

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1551202A (en) * 2003-05-06 2004-12-01 �Ҵ���˾ Self_repair method and storage system
US7653840B1 (en) * 2007-04-27 2010-01-26 Net App, Inc. Evaluating and repairing errors during servicing of storage devices
CN105068901A (en) * 2015-07-27 2015-11-18 浪潮电子信息产业股份有限公司 Method for detecting magnetic disc
CN109308238A (en) * 2018-12-03 2019-02-05 郑州云海信息技术有限公司 A kind of method, device and equipment that storage system disk array low-quality disk is adjusted
CN110752972A (en) * 2019-10-29 2020-02-04 北京浪潮数据技术有限公司 Network card state monitoring method, device, equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536548B (en) * 2018-04-10 2020-12-29 网宿科技股份有限公司 Method and device for processing bad track of disk and computer storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1551202A (en) * 2003-05-06 2004-12-01 �Ҵ���˾ Self_repair method and storage system
US7653840B1 (en) * 2007-04-27 2010-01-26 Net App, Inc. Evaluating and repairing errors during servicing of storage devices
CN105068901A (en) * 2015-07-27 2015-11-18 浪潮电子信息产业股份有限公司 Method for detecting magnetic disc
CN109308238A (en) * 2018-12-03 2019-02-05 郑州云海信息技术有限公司 A kind of method, device and equipment that storage system disk array low-quality disk is adjusted
CN110752972A (en) * 2019-10-29 2020-02-04 北京浪潮数据技术有限公司 Network card state monitoring method, device, equipment and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Jian Zhao ; Yongzhan He ; Hongmei Liu ; Jiajun Zhang ; Bin Liu ; Jun Zhang ; Wenqing Lv ; .Disk Failure Early Warning Based on the Characteristics of Customized SMART.IEEE.2020,全文. *
分布式存储系统中磁盘故障检测机制;刘榴;李小勇;;信息技术(第05期);全文 *
嵌入式系统中基于闪存平台的存储管理策略;李建勋;樊晓光;禚真福;;电子技术应用(第05期);全文 *

Also Published As

Publication number Publication date
CN112732517A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN110413227B (en) Method and system for predicting remaining service life of hard disk device on line
JP7158586B2 (en) Hard disk failure prediction method, apparatus and storage medium
US10229129B2 (en) Method and apparatus for managing time series database
US8255522B2 (en) Event detection from attributes read by entities
CN111600746B (en) Network fault positioning method, device and equipment
CN111459700A (en) Method and apparatus for diagnosing device failure, diagnostic device, and storage medium
Smith et al. An anomaly detection framework for autonomic management of compute cloud systems
CN110471816B (en) Data management method and device for solid state disk
KR20170084445A (en) Method and apparatus for detecting abnormality using time-series data
CN113220534A (en) Cluster multi-dimensional anomaly monitoring method, device, equipment and storage medium
CN109885456A (en) A kind of polymorphic type event of failure prediction technique and device based on system log cluster
CN112148561B (en) Method and device for predicting running state of business system and server
CN114595210A (en) Multi-dimensional data anomaly detection method and device and electronic equipment
CN112804079A (en) Cloud computing platform alarm analysis method, device, equipment and storage medium
CN112732517B (en) Disk fault alarm method, device, equipment and readable storage medium
CN111314158A (en) Big data platform monitoring method, device, equipment and medium
CN115080356A (en) Abnormity warning method and device
CN114860487A (en) Memory fault identification method and memory fault isolation method
CN114443441A (en) Storage system management method, device, equipment and readable storage medium
CN115690681A (en) Processing method of abnormity judgment basis, abnormity judgment method and device
CN111614504A (en) Power grid regulation and control data center service characteristic fault positioning method and system based on time sequence and fault tree analysis
CN116149926A (en) Abnormality monitoring method, device, equipment and storage medium for business index
CN112737120B (en) Regional power grid control report generation method and device and computer equipment
JP6666489B1 (en) Failure sign detection system
CN112799911A (en) Node health state detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant