WO2023236753A1 - 一种硬盘故障预测方法、装置、存储介质及电子装置 - Google Patents

一种硬盘故障预测方法、装置、存储介质及电子装置 Download PDF

Info

Publication number
WO2023236753A1
WO2023236753A1 PCT/CN2023/095118 CN2023095118W WO2023236753A1 WO 2023236753 A1 WO2023236753 A1 WO 2023236753A1 CN 2023095118 W CN2023095118 W CN 2023095118W WO 2023236753 A1 WO2023236753 A1 WO 2023236753A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
hard disk
risk
fault prediction
failure
Prior art date
Application number
PCT/CN2023/095118
Other languages
English (en)
French (fr)
Inventor
易哲
郑紫阳
王斌
陈建辉
王喜
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023236753A1 publication Critical patent/WO2023236753A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2257Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present disclosure relate to the field of communications, and specifically, to a hard disk failure prediction method, device, storage medium, and electronic device.
  • the NTF is higher when the hard drive failure prediction predicted by artificial intelligence is not accurate enough, and it is often impossible to explain to users why the risky disk is in a risky state, which will cause users to question and distrust the hard disk failure prediction system. The problem has not been raised yet. solution.
  • Embodiments of the present disclosure provide a hard disk failure prediction method, device, storage medium and electronic device to at least solve the problem in related technologies that when the artificial intelligence prediction of hard disk failure prediction is not accurate enough, the NTF is high, and why the risk disk is in a risk state is often unclear. Failure to explain it to users will cause users to question and distrust the hard drive failure prediction system.
  • a hard disk failure prediction method which method includes:
  • hard disk data to be processed, where the hard disk data includes hard disk self-monitoring, analysis and reporting technology (SMART) data and performance data;
  • SMART hard disk self-monitoring, analysis and reporting technology
  • the first fault prediction result is that there is a risk of failure
  • it is determined twice based on the SMART data and performance data of the hard disk data to be processed whether there is a risk of failure in the hard disk data to be processed, and a second fault prediction is obtained.
  • the hard disk failure risk level is determined according to the second failure prediction result.
  • a hard disk failure prediction device is also provided, and the device includes:
  • the first collection module is configured to collect hard disk data to be processed, where the hard disk data includes SMART data and performance data;
  • the first splicing module is configured to splice the SMART data and the performance data to obtain target hard disk data;
  • An input module configured to input the target hard disk data into a pre-trained fault prediction model to obtain the first fault prediction result output by the fault prediction model;
  • a secondary prediction module configured to determine whether a fault exists in the hard disk data to be processed based on the SMART data and performance data of the hard disk data to be processed when the first fault prediction result is that there is a risk of failure. risk, and obtain the second fault prediction result;
  • the first determination module is configured to determine the hard disk failure risk level according to the second failure prediction result.
  • a computer-readable storage medium is also provided, and a computer program is stored in the storage medium, wherein the computer program is configured to execute any of the above method embodiments when running. steps in.
  • an electronic device including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above. Steps in method embodiments.
  • Figure 1 is a hardware structure block diagram of a mobile terminal of a hard disk failure prediction method according to an embodiment of the present disclosure
  • Figure 2 is a flow chart of a hard disk failure prediction method according to an embodiment of the present disclosure
  • Figure 3 is a flow chart of a hard disk failure prediction method according to an optional embodiment of the present disclosure
  • Figure 4 is a flow chart of hard disk data preprocessing according to this embodiment
  • Figure 5 is a flow chart of hard disk data marking according to this embodiment.
  • Figure 6 is a flow chart of data cleaning according to this embodiment.
  • Figure 7 is a flow chart of hard disk failure prediction and risk hard disk processing according to this embodiment.
  • Figure 8 is a flow chart for determining risky hard drives through an expert system according to this embodiment.
  • Figure 9 is a flow chart of faulty hard disk processing according to this embodiment.
  • Figure 10 is a flow chart of determining the automatic backup cycle of a hard disk according to this embodiment.
  • Figure 11 is a flow chart of hard disk data backup according to this embodiment.
  • Figure 12 is a block diagram of a hard disk failure prediction device according to an embodiment of the present disclosure.
  • Figure 13 is a block diagram of a hard disk failure prediction device according to an optional embodiment of the present disclosure.
  • FIG. 1 is a hardware structure block diagram of a mobile terminal of the hard disk failure prediction method according to an embodiment of the present disclosure.
  • the mobile terminal may include one or more (only shown in FIG. 1 a) processor 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, wherein the above-mentioned mobile terminal may also include a communication function transmission device 106 and input and output device 108.
  • processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA
  • a memory 104 for storing data
  • the above-mentioned mobile terminal may also include a communication function transmission device 106 and input and output device 108.
  • the structure shown in Figure 1 is only illustrative, and it does not limit the structure of the above-mentioned mobile terminal.
  • the mobile terminal may also include more or fewer components than shown in FIG. 1 , or have a different configuration than shown in FIG. 1 .
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the hard disk failure prediction method in the embodiment of the present disclosure.
  • the processor 102 executes the computer program by running the computer program stored in the memory 104.
  • Various functional applications and hard disk failure prediction processing implement the above methods.
  • Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include memory located remotely relative to the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • the transmission device 106 is used to receive or send data via a network.
  • Specific examples of the above-mentioned network may include a wireless network provided by a communication provider of the mobile terminal.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet wirelessly.
  • NIC Network Interface Controller
  • FIG. 2 is a flow chart of the hard disk failure prediction method according to an embodiment of the present disclosure. As shown in Figure 2, the process includes the following step:
  • Step S202 collect hard disk data to be processed, where the hard disk data includes hard disk SMART data and performance data;
  • Step S204 splice SMART data and performance data to obtain target hard disk data
  • Step S206 input the target hard disk data into the pre-trained fault prediction model, and obtain the fault prediction result output by the fault prediction model;
  • Step S208 When the fault prediction result is that there is a risk of failure, perform a secondary fault prediction on the hard disk data to be processed to obtain a second fault prediction result;
  • step S208 it is determined whether the second fault prediction result is that there is a hard disk risk based on the SMART data of the hard disk data to be processed; if the judgment result is yes, it is determined that the second fault prediction result is that there is a hard disk risk. ; When the judgment result is no, determine whether the second fault prediction result is a hard disk risk based on the performance data; when the judgment result is yes, determine whether the second fault prediction result is a hard disk risk; when the judgment result is no In this case, it is determined that the second fault prediction result is that there is no hard disk risk.
  • the above SMART data at least includes: SMART5 (count of remapped sectors), SMART187 (number of uncorrectable errors), SMART188 (number of command timeouts), SMART197 (count of sectors currently to be mapped) and SMART198 (unable to be offline) Corrected sector count), if the original value of SMART5 is greater than the first preset value (for example, 500), determine that the second failure prediction result is a hard disk risk; if the original value of SMART187 is greater than the second preset value (for example, 100), It is determined that the second fault prediction result is that there is a hard disk risk; if the original value of SMART188 is greater than the third preset value (for example, 100), it is determined that the second fault prediction result is that there is a hard disk risk; if SMART197 or If the original value of SMART198 is greater than the fourth preset value (for example, 10), it is determined that the second failure prediction result is that there is a hard disk risk;
  • SMART5 count of remapped sectors
  • SMART187 number of uncorrectable errors
  • SMART188 number of command timeouts
  • SMART197 count of sectors currently to be mapped
  • SMART198 number of unmapped sectors
  • the determination result is yes, the determination will be completed. If the determination result is no, the determination sequence will be determined. , select one from the remaining SMART data to continue the judgment. If the judgment result is yes, it ends. Otherwise, continue to select one from the remaining SMART data to judge, and so on, until all SMART data are judged, and we get The final results will not be detailed here.
  • Step S210 Determine the hard disk failure risk level based on the second failure prediction result.
  • the above-mentioned step S208 may specifically include: when the fault prediction result is abnormal, secondly determining whether the hard disk data to be processed has a risk of failure based on the SMART data and performance data of the hard disk data to be processed, and obtaining the second Failure prediction results; if the second failure prediction result is that there is a risk of failure, determine the hard disk failure risk level to be the first level; if the second failure prediction result is that there is no risk of failure, determine the hard disk failure risk level to be the second level, first level The risk level is higher than the second risk level.
  • step S204 feature extraction is performed on the hard disk data to be processed to obtain the feature data to be processed; the feature data to be processed is filtered to obtain the target feature data to be processed; and the target feature data to be processed is determined. Meet the preset requirements. Specifically, determine whether the data time length of the target feature data to be processed is greater than or equal to the preset time length, and whether the number of sampling points is greater than or equal to the preset value; if the judgment result is yes, determine The target feature data to be processed meets the preset requirements.
  • step S208 when the hard disk failure risk level is the first level, it is prompted to replace the hard disk corresponding to the hard disk data to be processed; when the hard disk failure risk level is the second level, Back up the hard disk data to be processed.
  • RAID configuration of the hard disk corresponding to the hard disk data to be processed if the RAID configuration is RAID0, back up the hard disk data every day; if the RAID configuration is RAID5, back up the hard disk data every week; if the RAID configuration is For RAID1 or RIAID with a level greater than the above-mentioned RAID1, back up hard drive data according to the preset time period; if the RAID is configured as other types of RAID, back up hard drive data every week.
  • the other types of RAID are except RAID0, RAID5, RAID1 and levels.
  • Risk hard drives predicted by AI are divided into two categories according to the expert system.
  • the hard drive will be replaced directly. If the expert system does not determine that the hard drive is risky, the risk drive data will be automatically backed up. For high-risk disks, the hard disk is directly replaced, and for medium- and low-risk disks, data is automatically backed up.
  • Figure 3 is a flow chart of a hard disk failure prediction method according to an optional embodiment of the present disclosure. As shown in Figure 3, in the above steps Before S202, the process included the following steps:
  • Step S302 collect a preset amount of hard disk data, where the hard disk data includes SMART data and performance data;
  • Step S304 Splice a preset number of SMART data and performance data to form a training data set
  • Step S306 Train the fault prediction model according to the training data set to obtain the trained fault prediction model.
  • the training of the fault prediction model can be completed, so that fault prediction can be performed on the hard disk data to be processed based on the trained fault prediction model.
  • step S304 the invalid data and noise data in the training data set are cleaned, and missing data are filled in. Delete negative samples of faulty disks in the training data to reduce data noise.
  • step S304 feature extraction is performed on the preset number of hard disk data to obtain feature data corresponding to the preset number of hard disk data; feature data corresponding to the preset number of hard disk data is filtered to obtain The target feature data corresponding to the preset number of hard disk data; determine that the target feature data corresponding to the preset number of hard disk data meets the preset requirements. Specifically, determine whether the data time length of the target feature data corresponding to the preset number of hard disk data is greater than Or equal to the preset time length and whether the number of sampling points is greater than or equal to the preset value; if the judgment result is yes, it is determined that the target characteristic data corresponding to the preset amount of hard disk data meets the preset requirements.
  • step S306 may specifically include:
  • the label can be set in the following way: compare the hard disk failure time and the data collection time; when the interval between data collection times is less than N days, set the label to 1; between the data collection time and The interval between hard disk failure time is greater than N days and less than M days. If one of the multiple attribute fields of SMART data has an original value greater than 0, set the label to 1; in the interval between data collection time and hard disk failure time Greater than N days and less than M days, if the original values of multiple attribute fields of SMART data are equal to 0, then the label is set to 0; when the interval between the data collection time and the hard disk failure time is greater than M days, the label is set to 0, where , 1 represents risk, 0 represents no risk.
  • expert experience is combined to process the data in segments, making the data labels more consistent with the real status of the hard drive and with better interpretability.
  • the loss function L fl adopted in this embodiment is: Among them, ⁇ is the balance factor, ⁇ is the modulation parameter, y′ is the predicted value, and y is the true value of the sample.
  • is the balance factor
  • ⁇ ′ is the modulation parameter
  • y is the predicted value
  • y is the true value of the sample.
  • FIG. 4 is a flow chart of hard disk data preprocessing according to this embodiment. As shown in Figure 4, the data processing method includes the following steps:
  • Step S401 collect hard disk data
  • SMART data Use the smartctrl tool to collect the SMART data of the hard disk once a day. The collection time is at 3 am.
  • Hard drive performance data in-band tools collect hard drive performance data. Collect once an hour.
  • Step S402 perform feature extraction on hard disk data
  • Step S403 verify the hard disk data
  • Step S404 Splice SMART data and performance data to form a training data set.
  • Step S405 set labels for the hard disk data in the training data set
  • Figure 5 is a flow chart of hard disk data marking according to this embodiment. As shown in Figure 5, it includes the following steps:
  • Step S501 record the failure time of the failed hard disk
  • Step S502 determine whether the hard disk fails
  • each piece of data is marked as 0, otherwise, go to step S503.
  • Step S503 For the failed hard disk, determine the difference in days between the data collection time and the failure time;
  • Step S504 For data with a day difference of 2 to 5 days, check the SMART attribute of the data, and set a label for the hard disk data according to the SMART attribute;
  • SMART5 SMART187, SMART188, SMART197, and SMART198 has a value greater than 0, it will be marked as 1, otherwise it will be marked as 0.
  • Step S406 clean the hard disk data in the training data set
  • Figure 6 is a flow chart of data cleaning according to this embodiment. As shown in Figure 6, it includes the following steps:
  • Step S601 Check each data label of the faulty disk in the data set. Specifically, traverse the daily data of the faulty disk in the data set and check the tags.
  • Step S602 determine whether the label is 0;
  • the samples of the faulty disk with the label 0 are noise data.
  • Step S603 Delete noise data, that is, delete data with a label of 0.
  • Figure 7 is a flow chart of hard disk failure prediction and risk hard disk processing according to this embodiment. As shown in Figure 7, the data processing method includes the following steps:
  • Step S701 collect hard disk data
  • the training data needs to record the hard disk failure time and hard disk serial number for marking.
  • Step S702 data processing
  • the above method can be used for data processing; for test data, the above method can be used for data processing, but there is no two steps of data marking and data noise reduction.
  • Step S703 train the fault prediction model
  • the objective function of the algorithm uses the Focal Loss loss function.
  • the Focal Loss function formula is:
  • is the balance factor
  • is the modulation parameter
  • y′ is the predicted value
  • y is the true value of the sample.
  • Step S704 evaluate the trained fault prediction model
  • n pp refers to the number of hard disks predicted to fail in the next 30 days within the evaluation window
  • n tpp refers to the number of faulty memories discovered 30 days in advance within the evaluation window
  • n tr refers to all hard disks within the evaluation window Number of failures
  • n tpr refers to the number of failed hard disks discovered 30 days in advance within the evaluation window.
  • Step S705 perform inference on the hard disk data based on the trained fault prediction model
  • Step S706 perform a secondary prediction on the hard disk data through the expert system
  • Figure 8 is a flow chart for determining risky hard drives through the expert system according to this embodiment.
  • the expert system consists of two sets of rules. One type is SMART rule set, the other type is performance data rule set, including the following steps:
  • Step S801 check the SMART data and performance data of the hard disk
  • Step S802 determine whether the original value of SMART5 is greater than 500 (an example of the above-mentioned first preset value). If the determination result is no, execute step S803. If the determination result is yes, execute step S807;
  • Step S803 determine whether the original value of SMART187 is greater than 100 (an example of the above-mentioned second preset value). If the determination result is no, execute step S804. If the determination result is yes, execute step S807;
  • Step S804 determine whether the original value of SMART188 is greater than 100 (an example of the above-mentioned third preset value). If the determination result is no, execute step S805. If the determination result is yes, execute step S807;
  • Step S805 Determine whether the original value of SMART197 and the original value of SMART198 are greater than 10 (an example of the above-mentioned fourth preset value). If the judgment result is no, step S806 is executed. If the judgment result is yes, , execute step S807;
  • Step S806 Determine whether the average number of successfully read files per second from the hard disk is greater than 50 (an example of the fifth preset value mentioned above). If the judgment result is no, step S807 is executed. If the judgment result is yes, , execute step S808;
  • Step S807 Confirm that the hard disk is normal.
  • Step S808 Processing of risky hard disks. Specifically, for hard disks that are both judged as risky by the expert system and AI, it is recommended that User replaces the hard drive.
  • Figure 9 is a flow chart of faulty hard disk processing according to this embodiment. As shown in Figure 9, it includes the following steps:
  • Step S901 Obtain the first fault prediction result obtained by inferring the hard disk data using the fault prediction model
  • Step S902 determine whether the first fault prediction result indicates that the hard disk is normal. If the determination result is no, execute step S903;
  • step S903 If the inference result is normal, the hard disk is considered to be in a healthy state, otherwise, go to step S903.
  • Step S903 obtain the second fault prediction result obtained by the expert system's secondary prediction
  • Step S904 Determine the hard disk risk level according to the second fault prediction result, and process the risky hard disk according to the hard disk risk level.
  • the second fault prediction result obtained by the expert system believes that there is a risk in the hard disk and the hard disk risk level is high (corresponding to the first level), then replace the hard disk. If the second fault prediction result obtained by the expert system believes that the hard disk does not have a risk, the hard disk risk The low level (corresponding to the second level above) automatically backs up hard disk data regularly.
  • FIG. 10 is a flow chart of automatic hard disk backup cycle determination according to this embodiment, as shown in Figure 10 , including the following steps:
  • Step S1001 check the configuration of risky hard disk arrays (Redundant Arrays of Independent Disks, RAID for short);
  • Step S1002 if the RAID configuration is RAID0, automatically back up data every early morning, otherwise go to step S1003.
  • Step S1003 if the RAID configuration is RAID5, automatically back up data every week, otherwise, go to step S1004.
  • Step S1004 if the RAID is configured as RAID1 or higher level RAID, backup is not performed by default, but the user can configure the automatic backup cycle on the mobile phone. Otherwise, go to step S1005.
  • Step S1005 If the RAID is configured as other types of RAID, data is automatically backed up every week.
  • the backup cycle in this embodiment is only to illustrate the solution steps. In actual implementation, the backup cycle can be adjusted according to specific conditions.
  • Figure 11 is a flow chart of hard disk data backup according to this embodiment. As shown in Figure 11, it includes the following steps:
  • Step S1101 perform data compression on the hard disk data that needs to be backed up;
  • Step S1102 determine whether the local machine has a spare hard disk. If the determination result is yes, execute step S1103. If the determination result is no, execute step S1104;
  • Step S1103 back up the compressed data to other spare hard disks in the data center;
  • Step S1104 back up the compressed data to the spare hard disk.
  • a hard disk failure prediction device is also provided.
  • Figure 12 is a block diagram of the hard disk failure prediction device according to the embodiment of the present disclosure. As shown in Figure 12, the device includes:
  • the first collection module 122 is configured to collect hard disk data to be processed, where the hard disk data includes SMART data and performance data;
  • the first splicing module 124 is configured to splice the SMART data and the performance data to obtain the target hard disk. data;
  • the input module 126 is configured to input the target hard disk data into a pre-trained fault prediction model to obtain the first fault prediction result output by the fault prediction model;
  • the secondary prediction module 128 is configured to secondarily determine whether the hard disk data to be processed exists based on the SMART data and performance data of the hard disk data to be processed when the first fault prediction result is that there is a risk of failure. Failure risk, and obtain the second failure prediction result;
  • the first determination module 1210 is configured to determine the hard disk failure risk level according to the second failure prediction result.
  • the device further includes:
  • the first feature extraction module is configured to perform feature extraction on the hard disk data to be processed to obtain feature data to be processed;
  • the first filtering module is configured to filter the feature data to be processed and obtain the target feature data to be processed
  • the second determination module is configured to determine that the target characteristic data to be processed meets the preset requirements.
  • the first determination module is further configured to determine that the hard disk failure risk level is the first level if the second failure prediction result is that there is a risk of failure; if the second failure prediction result is If there is no risk of failure, the hard disk failure risk level is determined to be the second level, where the risk level of the first level is higher than the risk level of the second level.
  • the device further includes:
  • a prompt module configured to prompt the replacement of the hard disk corresponding to the hard disk data to be processed when the hard disk failure risk level is the first level
  • the backup module is configured to back up the hard disk data to be processed when the hard disk failure risk level is the second level.
  • the backup module is also configured to obtain the RAID configuration of the hard disk corresponding to the hard disk data to be processed; if the RAID configuration is RAID0, the hard disk data is backed up every day; if the RAID configuration is For RAID5, the hard disk data is backed up every week; if the RAID is configured as RAID1 or a RIAID with a level greater than RAID1, the hard disk data is backed up according to a preset time period; if the RAID is configured as other types of RAID, the hard disk data is backed up every The hard disk data is backed up weekly, wherein the other types of RAID are RAIDs other than the RAID0, the RAID5, the RAID1 and the RIAID with a level greater than the RAID1.
  • the secondary prediction module 128 is also configured to determine whether the second fault prediction result indicates that there is a hard disk risk based on the SMART data of the hard disk data to be processed; if the judgment result is yes, determine The second fault prediction result is that there is a hard disk risk; if the judgment result is no, determine whether the second fault prediction result is that the hard disk risk is present according to the performance data; if the judgment result is yes, determine The second fault prediction result is that there is a hard disk risk; if the judgment result is no, it is determined that the second fault prediction result is that there is no hard disk risk.
  • the secondary prediction module 128 is also configured such that the SMART data includes: SMART5, SMART187, SMART188, SMART197 and SMART198, and determines whether the original value of SMART5 is greater than the first preset value. If the result is yes, it is determined that the second fault prediction result is that there is a hard disk risk; if the judgment result is no, it is judged whether the original value of SMART187 is greater than the second preset value.
  • the judgment result is yes, In this case, it is determined that the second fault prediction result is that there is a hard disk risk; if the judgment result is no, it is judged whether the original value of SMART188 is greater than the third preset value; if the judgment result is yes, it is determined The second fault prediction result is that there is a hard disk risk; if the judgment result is no, it is judged whether the original value of SMART197 or SMART198 is greater than the fourth preset value, If the judgment result is yes, it is determined that the second fault prediction result is that there is a hard disk risk; if the judgment result is no, it is judged whether the average number of successfully read files per second of the hard disk in the performance data is greater than the fifth Default value; if the judgment result is yes, it is determined that the second fault prediction result is that there is a hard disk risk; if the judgment result is no, it is determined that the second fault prediction result is that there is no hard disk risk.
  • the device further includes:
  • the second collection module is configured to collect a preset amount of hard disk data, where the hard disk data includes SMART data and performance data;
  • the second splicing module is configured to splice a preset number of the SMART data and the performance data to form a training data set;
  • a training module is configured to train a fault prediction model according to the training data set to obtain the trained fault prediction model.
  • the device further includes:
  • the cleaning module is configured to clean invalid data and noise data in the training data set and fill in missing data.
  • the device further includes:
  • the second feature extraction module is configured to perform feature extraction on the preset number of hard disk data, and obtain feature data corresponding to the preset number of hard disk data;
  • the second filtering module is configured to filter the characteristic data corresponding to the preset number of hard disk data to obtain the target characteristic data corresponding to the preset number of hard disk data;
  • the third determination module is configured to determine that the target characteristic data corresponding to the preset quantity of hard disk data meets the preset requirements.
  • the training module includes:
  • the training submodule is configured to train the fault prediction model according to the training data set, and obtain the trained fault prediction model when the loss function meets the preset conditions.
  • the loss function L fl is: Among them, ⁇ is the balance factor, ⁇ is the modulation parameter, y′ is the predicted value, and y is the true value of the sample.
  • the setting sub-module is also configured to compare the hard disk failure time and the data collection time; when the interval between the data collection times is less than N days, the label is set to 1; If the interval between the data collection time and the hard disk failure time is greater than N days and less than M days, and the original value of one of the multiple attribute fields of the SMART data is greater than 0, the label is set to 1. ; When the interval between the data collection time and the hard disk failure time is greater than N days and less than M days, and the original values of multiple attribute fields of the SMART data are equal to 0, then the label is set to 0; The interval between the data collection time and the hard disk failure time is greater than M days, and the label is set to 0, where 1 represents risk and 0 represents no risk.
  • Figure 13 is a block diagram of a hard disk failure prediction device according to an optional embodiment of the present disclosure. As shown in Figure 13, it includes:
  • the data collection module 132 is configured to realize the functions of the first collection module 122 and the second collection module. It is mainly responsible for data collection of the hard disk and collects data at fixed time intervals. It mainly includes the following two types of data: SMART data and operating system internal data. Collected hard disk performance data, etc.
  • the hard disk SMART data includes the original value and the current value.
  • the main SMART attribute fields include SMART5, SMART187, SMART188, SMART197, SMART198, etc.;
  • Hard disk performance data within the operating system includes disk-level performance indicators, such as throughput, average wait time for I/O operations, etc. and server-level performance indicators, such as CPU activity, paging and page-out activity, etc.;
  • the feature extraction module 134 is configured to implement the function of the first feature extraction module. It is mainly responsible for feature extraction of the collected data, filtering out data columns not used by the detection prediction algorithm, and retaining only data columns that are subsequently useful.
  • the data verification module 136 is mainly responsible for verifying whether the data time length and the number of sampling points can meet the minimum data volume requirements for fault detection and prediction: the data collection interval is at least once a day and at least two days.
  • the data combination module 138 is mainly responsible for splicing SMART data and performance data to form a training data set.
  • the label calculation module 1310 is mainly responsible for calculating the label of each piece of data in the training data set based on the hard disk failure time.
  • the specific calculation method is to compare the hard disk failure time and the data collection time. If the collection time interval is less than N days, the mark is 1, which represents a risk. ;
  • the interval between collection time and failure time is greater than N days and less than M days, and among the collected data, one of the original values of the five attributes SMART5, SMART187, SMART 188, SMART 197, and SMART 198 is preferred is greater than 0 and is also marked as 1; If it is greater than N days and less than M days, and the original values of the above five attributes are all equal to 0, it will be marked as 0, which represents health.
  • the interval between the collection time and the failure time is greater than M days, and the data label is also marked as 0.
  • the data cleaning module 1312 is configured to implement the functions of the above-mentioned cleaning module. It is mainly responsible for cleaning invalid data and noise data in the data set and filling in missing data. For faulty hard disks in the training set, delete data records with a label of 0 to reduce noise in the data set.
  • the AI training module 1314 is configured to implement the functions of the above training module, and is mainly responsible for performing machine learning training on the training data set.
  • the training loss calculation module 1316 is configured to implement part of the functions of the above-mentioned training module. It is mainly responsible for calculating the loss of the sample so that the model can be trained in the direction of small loss. Specifically, the Focal Loss function is used as the loss function.
  • the inference module 1318 is configured to input unknown hard disk data into the trained model for inference.
  • the expert system module 1320 inputs data into the expert system for hard disks predicted to be risky by the AI model and determines the status of the hard disk again.
  • the risk disk processing module 1322 sets the risk level as high for hard disks that are judged to be abnormal by both the AI model and the expert system, and recommends the user to replace the hard disk. For hard disks that are not found to be abnormal by the expert system, the risk level is being positioned.
  • the hard disk data backup module 1324 is configured to implement the functions of the above backup module. For hard disks with a medium risk level, the hard disk data is compressed and the hard disk data is automatically and regularly backed up to other hard disks.
  • the other hard disks can be the server's hard disks or data.
  • the central backup disk is used to overwrite the previous backup with the next backup.
  • the automatic backup time is preferably in the early morning when the business volume is small.
  • Embodiments of the present disclosure also provide a computer-readable storage medium that stores a computer program, wherein the computer program is configured to execute the steps in any of the above method embodiments when running.
  • the computer-readable storage medium may include but is not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk magnetic disk or optical disk and other media that can store computer programs.
  • Embodiments of the present disclosure also provide an electronic device, including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.
  • the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.
  • modules or steps of the present disclosure can be implemented using general-purpose computing devices, and they can be concentrated on a single computing device, or distributed across a network composed of multiple computing devices. They may be implemented in program code executable by a computing device, such that they may be stored in a storage device for execution by the computing device, and in some cases may be executed in a sequence different from that shown herein. Or the described steps can be implemented by making them into individual integrated circuit modules respectively, or by making multiple modules or steps among them into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

提供了一种硬盘故障预测方法、装置、存储介质及电子装置,方法包括:采集待处理的硬盘数据,将硬盘数据的SMART数据和性能数据拼接起来,得到目标硬盘数据;将目标硬盘数据输入故障预测模型中,得到第一故障预测结果;在第一故障预测结果为存在故障风险的情况下,根据待处理的硬盘数据的SMART数据与性能数据二次确定待处理的硬盘数据是否存在故障风险,得到第二故障预测结果,根据第二故障预测结果确定硬盘故障风险等级,可以解决相关技术中人工智能预测硬盘故障预测不够准确时NTF较高,且对于风险盘为什么是风险状态,往往无法给用户解释,会造成用户对硬盘故障预测系统的质疑和不信任的问题,提高预测结果的可解释性。

Description

一种硬盘故障预测方法、装置、存储介质及电子装置
相关申请的交叉引用
本公开基于2022年06月09日提交的发明名称为“一种硬盘故障预测方法、装置、存储介质及电子装置”的中国专利申请CN202210649272.9,并且要求该专利申请的优先权,通过引用将其所公开的内容全部并入本公开。
技术领域
本公开实施例涉及通信领域,具体而言,涉及一种硬盘故障预测方法、装置、存储介质及电子装置。
背景技术
在大规模数据中心,由硬盘故障引起的IT基础设施稳定性、可靠性下降的问题频发。21世纪初以来,学术界针对硬盘故障预测及诊断开展了大量的研究,这些研究将大数据及人工智能方法进行结合,取得了较高的准确率。但对于具体工业场景下的硬盘故障预测,存在着环境业务复杂、噪声大、器件猝死等诸多问题,硬盘故障预测的准确率和算法的泛化能力在工业应用中仍达不到满意的效果。
为预测硬盘故障,学术界和工业界都做了不少研究,学术界一般只针对单一硬盘型号,故障类的精准率一般可以到90%左右,工业界需要考虑复杂的环境,多个硬盘厂家和多个硬盘型号,故障类的精准率综合下来目前一般只有80%左右。当预测错误时,会提升无故障返修(no trouble found,简称为NTF)的比例,影响数据中心的质量指标。同时预测错误也会导致错误更换硬件,给数据中心带来经济损失。另外,用人工智能方法预测硬盘故障时,对于风险盘为什么是风险状态,往往无法给用户解释,会造成用户对硬盘故障预测系统的质疑和不信任。
针对相关技术中人工智能预测硬盘故障预测不够准确时NTF较高,且对于风险盘为什么是风险状态,往往无法给用户解释,会造成用户对硬盘故障预测系统的质疑和不信任的问题,尚未提出解决方案。
发明内容
本公开实施例提供了一种硬盘故障预测方法、装置、存储介质及电子装置,以至少解决相关技术中人工智能预测硬盘故障预测不够准确时NTF较高,且对于风险盘为什么是风险状态,往往无法给用户解释,会造成用户对硬盘故障预测系统的质疑和不信任的问题。
根据本公开的一个实施例,提供了一种硬盘故障预测方法,所述方法包括:
采集待处理的硬盘数据,其中,所述硬盘数据包括硬盘自我监测、分析及报告技术(Self-Monitoring Analysis and Reporting Technology,简称为SMART)数据和性能数据;
将所述SMART数据和所述性能数据拼接起来,得到目标硬盘数据;
将所述目标硬盘数据输入预先训练好的故障预测模型中,得到所述故障预测模型输出的第一故障预测结果;
在所述第一故障预测结果为存在故障风险的情况下,根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果;
根据所述第二故障预测结果确定硬盘故障风险等级。
根据本公开的另一个实施例,还提供了一种硬盘故障预测装置,所述装置包括:
第一采集模块,设置为采集待处理的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
第一拼接模块,设置为将所述SMART数据和所述性能数据拼接起来,得到目标硬盘数据;
输入模块,设置为将所述目标硬盘数据输入预先训练好的故障预测模型中,得到所述故障预测模型输出的第一故障预测结果;
二次预测模块,设置为在所述第一故障预测结果为存在故障风险的情况下,根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果;
第一确定模块,设置为根据所述第二故障预测结果确定硬盘故障风险等级。
根据本公开的又一个实施例,还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本公开的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
附图说明
图1是本公开实施例的硬盘故障预测方法的移动终端的硬件结构框图;
图2是根据本公开实施例的硬盘故障预测方法的流程图;
图3是根据本公开可选实施例的硬盘故障预测方法的流程图;
图4是根据本实施例硬盘数据预处理的流程图;
图5是根据本实施例硬盘数据打标的流程图;
图6是根据本实施例的数据清洗的流程图;
图7是根据本实施例的硬盘故障预测和风险硬盘处理的流程图;
图8是根据本实施例的通过专家系统判定风险硬盘的流程图;
图9是根据本实施例的故障硬盘处理的流程图;
图10是根据本实施例的硬盘自动备份周期判定的流程图;
图11是根据本实施例的硬盘数据备份的流程图;
图12是根据本公开实施例的硬盘故障预测装置的框图;
图13是根据本公开可选实施例的硬盘故障预测装置的框图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本公开的实施例。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本公开实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例,图1是本公开实施例的硬盘故障预测方法的移动终端的硬件结构框图,如图1所示,移动终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,其中,上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述移动终端的结构造成限定。例如,移动终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本公开实施例中的硬盘故障预测方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及硬盘故障预测处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
在本实施例中提供了一种运行于上述移动终端或网络架构的硬盘故障预测方法,图2是根据本公开实施例的硬盘故障预测方法的流程图,如图2所示,该流程包括如下步骤:
步骤S202,采集待处理的硬盘数据,其中,所述硬盘数据包括硬盘SMART数据和性能数据;
步骤S204,将SMART数据和性能数据拼接起来,得到目标硬盘数据;
步骤S206,将目标硬盘数据输入预先训练好的故障预测模型中,得到所述故障预测模型输出故障预测结果;
步骤S208,在故障预测结果为存在故障风险的情况下,对待处理的硬盘数据进行二次故障预测,得到第二故障预测结果;
本实施例中,上述步骤S208中,根据待处理的硬盘数据的SMART数据判断第二故障预测结果是否为存在硬盘风险;在判断结果为是的情况下,确定第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,根据性能数据判断第二故障预测结果是否为存在硬盘风险;在判断结果为是的情况下,确定第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,确定第二故障预测结果为不存在硬盘风险。进一步的,上述的SMART数据至少包括:SMART5(重映射扇区计数)、SMART187(无法校正的错误数)、SMART188(命令超时数)、SMART197(当前待映射扇区计数)及SMART198(脱机无法校正的扇区计数),若SMART5的原始值大于第一预设值(例如500),确定第二故障预测结果为存在硬盘风险;若SMART187的原始值大于第二预设值(例如100),确定第二故障预测结果为存在硬盘风险;若SMART188的原始值大于第三预设值(例如100),确定第二故障预测结果为存在硬盘风险;若SMART197或 SMART198的原始值大于第四预设值(例如10),确定第二故障预测结果为存在硬盘风险;若性能数据为硬盘平均每秒成功读取文件数大于第五预设值(例如50),确定第二故障预测结果为存在硬盘风险;在上述条件均不满足的情况下,即若SMART5的原始值不大于第一预设值、SMART187的原始值不大于第二预设值、SMART188的原始值不大于第三预设值、SMART197的原始值不大于第四预设值、SMART198的原始值不大于第四预设值且性能数据为硬盘平均每秒成功读取文件数不大于第五预设值,确定第二故障预测结果为不存在硬盘风险。当然也可以选用其他SMART数据判断第二故障预测结果是否为存在硬盘风险,起判断方式与上述类似,在此不再赘述。另外,需要说明书的是,上述判断过程中,SMART5(重映射扇区计数)、SMART187(无法校正的错误数)、SMART188(命令超时数)、SMART197(当前待映射扇区计数)及SMART198(脱机无法校正的扇区计数)的判断先后顺序,并实施例并不进行限定,可以先判断其中人一个SMART数据,在判断结果为是的情况下,结束判断,在判断结果为否的情况下,从剩下的SMART数据中选择一种继续判断,判断结果为是则结束,否则继续从剩下的SMART数据中选取一个进行判断,以此类推,直到所有的SMART数据均判断完成,得出最终的结果,此处不再一一赘述。
步骤S210,根据第二故障预测结果确定硬盘故障风险等级。
本实施例中,上述步骤S208具体可以包括:在故障预测结果为异常的情况下,根据待处理的硬盘数据的SMART数据与性能数据二次确定待处理的硬盘数据是否存在故障风险,得到第二故障预测结果;若第二故障预测结果为存在故障风险,确定硬盘故障风险等级为第一等级;若第二故障预测结果为不存在故障风险,确定硬盘故障风险等级为第二等级,第一等级的风险级别高于第二等级的风险级别。
通过上述步骤S202至S208,可以解决相关技术中人工智能预测硬盘故障预测不够准确时NTF较高,且对于风险盘为什么是风险状态,往往无法给用户解释,会造成用户对硬盘故障预测系统的质疑和不信任的问题,结合专家系统和故障预模型预测硬盘故障,提升了预测精准率,同时在故障预模型预测不够准确时降低NTF,提高预测结果的可解释性。
在一实施例中,上述步骤S204之前,对待处理的硬盘数据进行特征提取,得到待处理的特征数据;对待处理的特征数据进行过滤,得到待处理的目标特征数据;确定待处理的目标特征数据满足预设要求,具体的,判断待处理的目标特征数据的数据时间长度是否大于或等于预设时间长度、采样点个数是否大于或等于预设值;在判断结果为是的情况下,确定待处理的目标特征数据满足预设要求。
在另一实施例中,在上述步骤S208之后,在硬盘故障风险等级为第一等级的情况下,提示更换待处理的硬盘数据对应的硬盘;在硬盘故障风险等级为第二等级的情况下,对待处理的硬盘数据进行备份,具体的,获取待处理的硬盘数据对应的硬盘的RAID配置;若RAID配置为RAID0,每天备份硬盘数据;若RAID配置为RAID5,每周备份硬盘数据;若RAID配置为RAID1或级别大于所述RAID1的RIAID,按照预设的时间周期备份硬盘数据;若RAID配置为其他类型的RAID,每周备份硬盘数据,该其他类型的RAID为除RAID0、RAID5、RAID1及级别大于RAID1的RIAID之外的RAID。对于AI预测出的风险硬盘,根据专家系统分为两类,如果专家系统判定硬盘处于风险状态,直接更换硬盘,如果专家系统未判定硬盘有风险,则对风险盘数据自动进行备份。对于高风险盘,直接更换硬盘,对中低风险盘,自动备份数据。
图3是根据本公开可选实施例的硬盘故障预测方法的流程图,如图3所示,在上述步骤 S202之前,该流程包括如下步骤:
步骤S302,采集预设数量的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
步骤S304,将预设数量的所述SMART数据和性能数据拼接起来,形成训练数据集;
步骤S306,根据训练数据集对故障预测模型进行训练,得到训练好的所述故障预测模型。
通过上述步骤S302至S306,可以完成故障预测模型的训练,以便根据训练好的故障预测模型对待处理的硬盘数据进行故障预测。用大数据分析方法对硬盘SMART数据和性能数据分析,用机器学习算法对数据进行AI训练,用训练后的模型对硬盘数据进行故障预测。
在一实施例中,在上述步骤S304之前,对训练数据集中的无效数据和噪声数据进行清洗,并补齐缺失数据。删除训练数据中故障盘的负样本,降低数据噪声。
在另一实施例,在上述步骤S304之前,对预设数量的硬盘数据进行特征提取,得到预设数量的硬盘数据对应的特征数据;对预设数量的硬盘数据对应的特征数据进行过滤,得到预设数量的硬盘数据对应的目标特征数据;确定预设数量的硬盘数据对应的目标特征数据满足预设要求,具体的,判断预设数量的硬盘数据对应的目标特征数据的数据时间长度是否大于或等于预设时间长度、采样点个数是否大于或等于预设值;在判断结果为是的情况下,确定预设数量的硬盘数据对应的目标特征数据满足预设要求。
本实施例中,上述步骤S306具体可以包括:
S1,根据硬盘故障时间设置所述训练数据集中每条数据的标签;
进一步的,上述S1中,可以通过一下方式设置标签:对比硬盘故障时间和数据采集时间;在数据采集时间的间隔小于N天的情况下,则将所述标签设置为1;在数据采集时间和硬盘故障时间的间隔大于N天且小于M天,SMART数据的多个属性字段中有一个属性字段的原始值大于0的情况下,将标签设置为1;在数据采集时间和硬盘故障时间的间隔大于N天且小于M天,SMART数据的多个属性字段的原始值都等于0,则将标签设置为0;在数据采集时间和硬盘故障时间的间隔大于M天,将标签设置为0,其中,1代表风险,0代表无风险。对训练数据进行打标时,结合专家经验,对数据进行分段处理,使得数据标签更符合硬盘的真实状态并具有较好的可解释性。
S2,根据所述训练数据集对所述故障预测模型进行训练,在损失函数满足预设条件的情况下,得到训练好的所述故障预测模型。
本实施例或者采用的损失函数Lfl为:其中,α为平衡因子,γ为调制参数,y′为预测值,y为样本真实值。使用Focal Loss函数作为损失函数,可以解决硬盘数据的正负样本严重不均衡的问题。
图4是根据本实施例硬盘数据预处理的流程图,如图4所示,数据处理方法包括以下步骤:
步骤S401,采集硬盘数据;
采集两类数据:1、SMART数据,用smartctrl工具采集硬盘的SMART数据,一天采集一次,采集时间放在凌晨3点。2、硬盘性能数据,带内工具采集硬盘性能数据。一小时采集一次。
步骤S402,对硬盘数据进行特征提取;
对于采集到的两类数据用Pearson相关系数进行分析,删除掉数据相关性大的特征列和 数据没有变化的特征列;
步骤S403,对硬盘数据进行校验;
校验数据时间长度和采样点个数是否能够满足故障检测和预测的数据量最低条件要求:数据采集时间一天采集一次,连续采集三天,两类数据至少有一类采集到了数据。
步骤S404,拼接SMART数据与性能数据,形成训练数据集。
步骤S405,对训练数据集中的硬盘数据设置标签;
根据硬盘故障时间计算训练数据集中每条数据的标签,对于正常硬盘,统一打标为0。图5是根据本实施例硬盘数据打标的流程图,如图5所示,包括以下步骤:
步骤S501,记录故障硬盘的坏盘时间;
对故障硬盘,记录下硬盘故障的日期。
步骤S502,判断硬盘是否发生故障;
对于没有发生故障的硬盘,每条数据都标注为0,否则转步骤S503。
步骤S503,对于发生故障的硬盘,确定数据的采集时间和故障时间的天数差值;
如果天数小于2天,则打标为1。如果天数大于5天,则打标为0。否则转步骤S504。
步骤S504,对于天数差值为2至5天的数据,查看数据的SMART属性,根据SMART属性为硬盘数据设置标签;
具体的,如果SMART5,SMART187,SMART188,SMART197,SMART198五个属性中有一个属性值大于0,则打标为1,否则打标为0。
本实施例中的SMART属性的选取和对应的阈值只为了说明方案步骤,实际实施中可以根据具体的情况对属性选择和阈值进行调整。
步骤S406,对训练数据集中的硬盘数据清洗;
对缺失的属性值填充0,对数据进行降噪。图6是根据本实施例的数据清洗的流程图,如图6所示,包括以下步骤:
步骤S601,查看数据集中故障盘每条数据标签,具体的,遍历数据集中故障盘每天的数据,查看标签。
步骤S602,判断标签是否为0;
判断故障盘的标签是否为0,标签为0的故障盘样本为噪声数据。
步骤S603,删除噪声数据,即删除标签为0的数据。
图7是根据本实施例的硬盘故障预测和风险硬盘处理的流程图,如图7所示,数据处理方法包括以下步骤:
步骤S701,采集硬盘数据;
包括训练数据采集和测试数据采集,训练数据需要记录下硬盘故障时间和硬盘序列号用于打标。
步骤S702,数据处理;
对于训练数据,采用上述方法进行数据处理即可;对于测试数据,按照上述方法进行数据处理即可,但是没有数据打标和数据降噪两步。
步骤S703,训练故障预测模型;
使用LightGBM二分类算法对训练数据集进行训练,算法的目标函数使用Focal Loss损失函数,Focal Loss函数公式为:
其中α为平衡因子,γ为调制参数,y′为预测值,y为样本真实值。
步骤S704,评估训练后的故障预测模型;
在训练集中随机抽取30%数据为验证集,剩下数据作为训练集,对于验证集数据,采用F1-Score作为评价指标,定义相关术语和详细指标如下,其中precision为精准率,recall为召回率:npp指的是评估窗口内被预测出未来30天会坏的硬盘数,ntpp指的是评估窗口内故障内存被提前30天发现的数量,ntr指的是评估窗口内所有的硬盘故障数,ntpr指的是评估窗口内故障硬盘被提前30天发现的数量。


步骤S705,基于训练好的故障预测模型对硬盘数据进行推理;
用训练好的模型对测试数据进行推理,推理的结果应该是0到1之间的一个值。
步骤S706,通过专家系统对硬盘数据进行二次预测;
对于AI模型推理出来的风险硬盘,数据输入专家系统,图8是根据本实施例的通过专家系统判定风险硬盘的流程图,如图8所示,专家系统由两组规则集组成,一类为SMART规则集,另一类为性能数据规则集,包括以下步骤:
步骤S801,查看硬盘SMART数据和性能数据;
将AI模型判断为风险的硬盘数据输入专家系统,查看这些硬盘的SMART数据的原始值和性能数据。
步骤S802,判断SMART5的原始值是否大于500(上述第一预设值的一种示例),在判断结果为否的情况下,执行步骤S803,在判断结果为是的情况下,执行步骤S807;
步骤S803,判断SMART187的原始值是否大于100(上述第二预设值的一种示例),在判断结果为否的情况下,执行步骤S804,在判断结果为是的情况下,执行步骤S807;
步骤S804,判断SMART188的原始值是否大于100(上述第三预设值的一种示例),在判断结果为否的情况下,执行步骤S805,在判断结果为是的情况下,执行步骤S807;
步骤S805,判断SMART197的原始值和SMART198的原始值是否大于10(上述第四预设值的一种示例),在判断结果为否的情况下,执行步骤S806,在判断结果为是的情况下,执行步骤S807;
步骤S806,判断硬盘平均每秒成功读取文件数是否大于50(上述第五预设值的一种示例),在判断结果为否的情况下,执行步骤S807,在判断结果为是的情况下,执行步骤S808;
查看性能数据如果硬盘平均每秒成功读取文件数大于50,认为硬盘风险,否则,认为硬盘正常。
步骤S807,确定硬盘正常。
步骤S808,风险硬盘处理,具体的,对于专家系统和AI同时判断为风险的硬盘,建议 用户更换硬盘。
本实施例中的SMART属性和性能属性的选取和对应的阈值只为了说明方案步骤,实际实施中可以根据具体的情况对属性选择和阈值进行调整。
图9是根据本实施例的故障硬盘处理的流程图,如图9所示,包括以下步骤:
步骤S901,获取故障预测模型对硬盘数据进行推理得到的第一故障预测结果;
对待处理的硬盘数据进行预处理后,用上述训练好的故障预测模型对硬盘数据进行推理。
步骤S902,判断第一故障预测结果是否硬盘正常,在判断结果为否的情况下,执行步骤S903;
如果推理结果为正常,则认为硬盘是健康状态,否则,转步骤S903。
步骤S903,获取专家系统二次预测得到的第二故障预测结果;
步骤S904,根据第二故障预测结果确定硬盘风险等级,根据硬盘风险等级对风险硬盘进行处理。
如果专家系统得到的第二故障预测结果认为硬盘存在风险,硬盘风险等级高(对应上的第一等级),则更换硬盘,如果专家系统得到的第二故障预测结果认为硬盘不存在风险,硬盘风险等级低(对应上的第二等级),定时自动备份硬盘数据。
对于AI判断为风险但专家系统认为没有风险的硬盘,系统认为硬盘处于低风险状态,实施自动备份硬盘数据,图10是根据本实施例的硬盘自动备份周期判定的流程图,如图10所示,包括以下步骤:
步骤S1001,查看风险硬盘磁盘阵列(Redundant Arrays of Independent Disks,简称:RAID)配置;
步骤S1002,若RAID配置为RAID0,则每天凌晨自动备份数据,否则转步骤S1003。
步骤S1003,若RAID配置为RAID5,每周自动备份数据,否则,转步骤S1004。
步骤S1004,若RAID配置为RAID1或者更高级别的RAID,默认不备份,但用户可以手机配置自动备份周期,否则,转步骤S1005。
步骤S1005,若RAID配置为其他类型的RAID,每周自动备份数据。
本实施例中的备份周期只为了说明方案步骤,实际实施中可以根据具体的情况对备份周期进行调整。
图11是根据本实施例的硬盘数据备份的流程图,如图11所示,包括以下步骤:
步骤S1101,对需要备份的硬盘数据进行数据压缩;
步骤S1102,判断本机是否有备用硬盘,在判断结果为是的情况下,执行步骤S1103,在判断结果为否的情况下,执行步骤S1104;
步骤S1103,将压缩数据备份到数据中心其他备用硬盘上;
步骤S1104,将压缩数据备份到备用硬盘。
根据本公开实施例的另一方面,还提供了一种硬盘故障预测装置,图12是根据本公开实施例的硬盘故障预测装置的框图,如图12所示,所述装置包括:
第一采集模块122,设置为采集待处理的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
第一拼接模块124,设置为将所述SMART数据和所述性能数据拼接起来,得到目标硬盘 数据;
输入模块126,设置为将所述目标硬盘数据输入预先训练好的故障预测模型中,得到所述故障预测模型输出的第一故障预测结果;
二次预测模块128,设置为在所述第一故障预测结果为存在故障风险的情况下,根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果;
第一确定模块1210,设置为根据所述第二故障预测结果确定硬盘故障风险等级。
在一实施例中,所述装置还包括:
第一特征提取模块,设置为对所述待处理的硬盘数据进行特征提取,得到待处理的特征数据;
第一过滤模块,设置为对所述待处理的特征数据进行过滤,得到待处理的目标特征数据;
第二确定模块,设置为确定所述待处理的目标特征数据满足预设要求。
在一实施例中,所述第一确定模块,还设置为若所述第二故障预测结果为存在故障风险,确定所述硬盘故障风险等级为第一等级;若所述第二故障预测结果为不存在故障风险,确定所述硬盘故障风险等级为第二等级,其中,所述第一等级的风险级别高于所述第二等级的风险级别。
在一实施例中,所述装置还包括:
提示模块,设置为在所述硬盘故障风险等级为所述第一等级的情况下,提示更换所述待处理的硬盘数据对应的硬盘;
备份模块,设置为在所述硬盘故障风险等级为所述第二等级的情况下,对所述待处理的硬盘数据进行备份。
在一实施例中,所述备份模块,还设置为获取所述待处理的硬盘数据对应的硬盘的RAID配置;若所述RAID配置为RAID0,每天备份所述硬盘数据;若所述RAID配置为RAID5,每周备份所述硬盘数据;若所述RAID配置为RAID1或级别大于所述RAID1的RIAID,按照预设的时间周期备份所述硬盘数据;若所述RAID配置为其他类型的RAID,每周备份所述硬盘数据,其中,所述其他类型的RAID为除所述RAID0、所述RAID5、所述RAID1及所述级别大于所述RAID1的RIAID之外的RAID。
在一实施例中,所述二次预测模块128,还设置为根据待处理的硬盘数据的SMART数据判断所述第二故障预测结果是否为存在硬盘风险;在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,根据所述性能数据判断所述第二故障预测结果是否为存在硬盘风险;在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,确定所述第二故障预测结果为不存在硬盘风险。
在一实施例中,所述二次预测模块128,还设置为所述SMART数据包括:SMART5、SMART187、SMART188、SMART197及SMART198,判断所述SMART5的原始值是否大于第一预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,判断所述SMART187的原始值是否大于第二预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,判断所述SMART188的原始值是否大于第三预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,判断所述SMART197或SMART198的原始值是否大于第四预设值, 在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,判断所述性能数据中硬盘平均每秒成功读取文件数是否大于第五预设值;在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,确定所述第二故障预测结果为不存在硬盘风险。
在一实施例中,所述装置还包括:
第二采集模块,设置为采集预设数量的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
第二拼接模块,设置为将预设数量的所述SMART数据和所述性能数据拼接起来,形成训练数据集;
训练模块,设置为根据所述训练数据集对故障预测模型进行训练,得到训练好的所述故障预测模型。
在一实施例中,所述装置还包括:
清洗模块,设置为对所述训练数据集中的无效数据和噪声数据进行清洗,并补齐缺失数据。
在一实施例中,所述装置还包括:
第二特征提取模块,设置为所述预设数量的硬盘数据进行特征提取,得到所述预设数量的硬盘数据对应的特征数据;
第二过滤模块,设置为对所述预设数量的硬盘数据对应的特征数据进行过滤,得到所述预设数量的硬盘数据对应的目标特征数据;
第三确定模块,设置为确定所述预设数量的硬盘数据对应的目标特征数据满足预设要求。
在一实施例中,所述训练模块包括:
设置子模块,设置为根据硬盘故障时间设置所述训练数据集中每条数据的标签;
训练子模块,设置为根据所述训练数据集对所述故障预测模型进行训练,在损失函数满足预设条件的情况下,得到训练好的所述故障预测模型。
在一实施例中,所述损失函数Lfl为:其中,α为平衡因子,γ为调制参数,y′为预测值,y为样本真实值。
在一实施例中,所述设置子模块,还设置为对比硬盘故障时间和数据采集时间;在所述数据采集时间的间隔小于N天的情况下,则将所述标签设置为1;在所述数据采集时间和所述硬盘故障时间的间隔大于N天且小于M天,所述SMART数据的多个属性字段中有一个属性字段的原始值大于0的情况下,将所述标签设置为1;在所述数据采集时间和所述硬盘故障时间的间隔大于N天且小于M天,所述SMART数据的多个属性字段的原始值都等于0,则将所述标签设置为0;在所述数据采集时间和所述硬盘故障时间的间隔大于M天,将所述标签设置为0,其中,1代表风险,0代表无风险。
图13是根据本公开可选实施例的硬盘故障预测装置的框图,如图13所示,包括:
数据采集模块132,设置为实现上述第一采集模块122与第二采集模块的功能,其主要负责硬盘的数据采集,按照固定时间间隔采集数据,主要包括以下两类数据:SMART数据、操作系统内采集的硬盘性能数据等。
其中硬盘SMART数据包括原始值和当前值,主要SMART属性字段有SMART5,SMART187,SMART188,SMART197,SMART198等;
操作系统内硬盘性能数据包括磁盘级性能指标,例如吞吐量,I/O操作平均等待时间等和服务器级性能指标,例如CPU活动,分页和页出活动等;
特征提取模块134,设置为实现第一特征提取模块的功能,其主要负责对采集上来的数据进行特征提取,过滤掉检测预测算法未使用的数据列,只保留后续有用到的数据列。
数据校验模块136,其主要负责校验数据时间长度和采样点个数是否能够满足故障检测和预测的数据量最低条件要求:数据采集间隔至少一天一次,至少采集两天。
数据组合模块138其主要负责将SMART数据和性能数据拼接起来,形成训练数据集。
标签计算模块1310,其主要负责根据硬盘故障时间计算训练数据集中每条数据的标签,具体计算方法是对比硬盘故障时间和数据采集时间,如果采集时间间隔小于N天,则打标为1代表风险;采集时间和故障时间的间隔大于N天小于M天并且采集的数据中优选SMART5,SMART187,SMART 188,SMART 197,SMART 198这5个属性的原始值有一个大于0,也打标为1;大于N天小于M天,且上述5个属性的原始值都等于0,则打标为0,代表健康。采集时间和故障时间的间隔大于M天,数据标签也打标为0。
数据清洗模块1312,设置为实现上述清洗模块的功能,其主要负责清洗数据集中的无效数据和噪声数据,并补齐缺失数据。对于训练集中故障硬盘,删除掉标签为0的数据记录,降低数据集中的噪声。
AI训练模块1314,设置为实现上述训练模块的功能,其主要负责对训练数据集进行机器学习训练。
训练损失计算模块1316,设置为实现上述训练模块的部分功能,其主要负责对样本的损失进行计算,使得模型向损失小的方向训练。具体采用Focal Loss函数作为损失函数。
推理模块1318,设置为将未知的硬盘数据输入训练好的模型进行推理。
专家系统模块1320,对于AI模型预测为风险的硬盘,将数据输入专家系统,再次判定硬盘状态。
风险盘处理模块1322,对于AI模型和专家系统都判断为异常的硬盘,将风险级别定为高,建议用户更换硬盘,对于专家系统没有发现异常的硬盘,风险级别定位中。
硬盘数据备份模块1324,设置为实现上述备份模块的功能,对于风险级别为中的硬盘,对硬盘数据进行压缩并自动定时备份硬盘数据到其他硬盘,其他硬盘可以是本服务器硬盘,也可以是数据中心专用的备份盘,用后一次的备份覆盖前一次备份,自动备份时间优选在凌晨业务量小的时间段。
本公开的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本公开的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本公开的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本公开不限制于任何特定的硬件和软件结合。
以上所述仅为本公开的优选实施例而已,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (16)

  1. 一种硬盘故障预测方法,所述方法包括:
    采集待处理的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
    将所述SMART数据和所述性能数据拼接起来,得到目标硬盘数据;
    将所述目标硬盘数据输入预先训练好的故障预测模型中,得到所述故障预测模型输出的第一故障预测结果;
    在所述第一故障预测结果为存在故障风险的情况下,根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果;
    根据所述第二故障预测结果确定硬盘故障风险等级。
  2. 根据权利要求1所述的方法,其中,在将所述SMART数据和所述性能数据拼接起来,得到目标硬盘数据之前,所述方法还包括:
    对所述待处理的硬盘数据进行特征提取,得到待处理的特征数据;
    对所述待处理的特征数据进行过滤,得到待处理的目标特征数据;
    确定所述待处理的目标特征数据满足预设要求。
  3. 根据权利要求1所述的方法,其中,根据所述第二故障预测结果确定硬盘故障风险等级包括:
    若所述第二故障预测结果为存在故障风险,确定所述硬盘故障风险等级为第一等级;
    若所述第二故障预测结果为不存在故障风险,确定所述硬盘故障风险等级为第二等级,其中,所述第一等级的风险级别高于所述第二等级的风险级别。
  4. 根据权利要求3所述的方法,其中,在根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果之后,所述方法还包括:
    在所述硬盘故障风险等级为所述第一等级的情况下,提示更换所述待处理的硬盘数据对应的硬盘;
    在所述硬盘故障风险等级为所述第二等级的情况下,对所述待处理的硬盘数据进行备份。
  5. 根据权利要求4所述的方法,其中,对所述待处理的硬盘数据进行备份包括:
    获取所述待处理的硬盘数据对应的硬盘的RAID配置;
    若所述RAID配置为RAID0,每天备份所述硬盘数据;
    若所述RAID配置为RAID5,每周备份所述硬盘数据;
    若所述RAID配置为RAID1或级别大于所述RAID1的RAID,按照预设的时间周期备份所述硬盘数据;
    若所述RAID配置为其他类型的RAID,每周备份所述硬盘数据,其中,所述其他类型的RAID为除所述RAID0、所述RAID5、所述RAID1及所述级别大于所述RAID1的RAID之外的RAID。
  6. 根据权利要求1所述的方法,其中,根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果包括:
    根据所述待处理的硬盘数据的SMART数据判断所述第二故障预测结果是否为存在硬盘风险;
    在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;
    在判断结果为否的情况下,根据所述性能数据判断所述第二故障预测结果是否为存在硬盘风险;在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,确定所述第二故障预测结果为不存在硬盘风险。
  7. 根据权利要求6所述的方法,其中,
    根据所述待处理的硬盘数据的SMART数据判断所述第二故障预测结果是否为存在硬盘风险包括:
    所述SMART数据至少包括:SMART5、SMART187、SMART188、SMART197及SMART198,判断所述SMART5的原始值是否大于第一预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;
    在判断结果为否的情况下,判断所述SMART187的原始值是否大于第二预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;
    在判断结果为否的情况下,判断所述SMART188的原始值是否大于第三预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;
    在判断结果为否的情况下,判断所述SMART197或SMART198的原始值是否大于第四预设值,在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;
    在判断结果为否的情况下,根据所述性能数据判断所述第二故障预测结果是否为存在硬盘风险包括:判断所述性能数据中硬盘平均每秒成功读取文件数是否大于第五预设值;在判断结果为是的情况下,确定所述第二故障预测结果为存在硬盘风险;在判断结果为否的情况下,确定所述第二故障预测结果为不存在硬盘风险。
  8. 根据权利要求1至7中任一项所述的方法,其中,在采集待处理的硬盘数据之前,所述方法还包括:
    采集预设数量的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
    将预设数量的所述SMART数据和所述性能数据拼接起来,形成训练数据集;
    根据所述训练数据集对故障预测模型进行训练,得到训练好的所述故障预测模型。
  9. 根据权利要求8所述的方法,其中,在根据所述训练数据集对故障预测模型进行训练,得到训练好的所述故障预测模型之前,所述方法还包括:
    对所述训练数据集中的无效数据和噪声数据进行清洗,并补齐缺失数据。
  10. 根据权利要求8所述的方法,其中,在将预设数量的所述SMART数据和所述性能数据拼接起来,形成训练数据集之前,所述方法还包括:
    对所述预设数量的硬盘数据进行特征提取,得到所述预设数量的硬盘数据对应的特征数据;
    对所述预设数量的硬盘数据对应的特征数据进行过滤,得到所述预设数量的硬盘数据对应的目标特征数据;
    确定所述预设数量的硬盘数据对应的目标特征数据满足预设要求。
  11. 根据权利要求8所述的方法,其中,根据所述训练数据集对故障预测模型进行训练,得到训练好的所述故障预测模型包括:
    根据硬盘故障时间设置所述训练数据集中每条数据的标签;
    根据所述训练数据集对所述故障预测模型进行训练,在损失函数满足预设条件的情况下,得到训练好的所述故障预测模型。
  12. 根据权利要求11所述的方法,其中,所述损失函数Lfl为:
    其中,α为平衡因子,γ为调制参数,y′为预测值,y为样本真实值。
  13. 根据权利要求11所述的方法,其中,根据硬盘故障时间设置所述训练数据集中每条数据的标签包括:
    对比硬盘故障时间和数据采集时间;
    在所述数据采集时间的间隔小于N天的情况下,则将所述标签设置为1;
    在所述数据采集时间和所述硬盘故障时间的间隔大于N天且小于M天,所述SMART数据的多个属性字段中有一个属性字段的原始值大于0的情况下,将所述标签设置为1;
    在所述数据采集时间和所述硬盘故障时间的间隔大于N天且小于M天,所述SMART数据的多个属性字段的原始值都等于0,则将所述标签设置为0;
    在所述数据采集时间和所述硬盘故障时间的间隔大于M天,将所述标签设置为0,其中,1代表风险,0代表无风险。
  14. 一种硬盘故障预测装置,所述装置包括:
    第一采集模块,设置为采集待处理的硬盘数据,其中,所述硬盘数据包括SMART数据和性能数据;
    第一拼接模块,设置为将所述SMART数据和所述性能数据拼接起来,得到目标硬盘数据;
    输入模块,设置为将所述目标硬盘数据输入预先训练好的故障预测模型中,得到所述故障预测模型输出的第一故障预测结果;
    二次预测模块,设置为在所述第一故障预测结果为存在故障风险的情况下,根据所述待处理的硬盘数据的SMART数据与性能数据二次确定所述待处理的硬盘数据是否存在故障风险,得到第二故障预测结果;
    第一确定模块,设置为根据所述第二故障预测结果确定硬盘故障风险等级。
  15. 一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至13任一项中所述的方法。
  16. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至13任一项中所述的方法。
PCT/CN2023/095118 2022-06-09 2023-05-18 一种硬盘故障预测方法、装置、存储介质及电子装置 WO2023236753A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210649272.9 2022-06-09
CN202210649272.9A CN117271229A (zh) 2022-06-09 2022-06-09 一种硬盘故障预测方法、装置、存储介质及电子装置

Publications (1)

Publication Number Publication Date
WO2023236753A1 true WO2023236753A1 (zh) 2023-12-14

Family

ID=89117585

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095118 WO2023236753A1 (zh) 2022-06-09 2023-05-18 一种硬盘故障预测方法、装置、存储介质及电子装置

Country Status (2)

Country Link
CN (1) CN117271229A (zh)
WO (1) WO2023236753A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845760A (zh) * 2018-05-28 2018-11-20 郑州云海信息技术有限公司 一种硬盘维护方法、装置、设备及可读存储介质
CN109491850A (zh) * 2018-11-21 2019-03-19 北京北信源软件股份有限公司 一种磁盘故障预测方法及装置
CN110164501A (zh) * 2018-06-29 2019-08-23 腾讯科技(深圳)有限公司 一种硬盘检测方法、装置、存储介质及设备
US20200233587A1 (en) * 2019-01-18 2020-07-23 EMC IP Holding Company LLC Method, device and computer product for predicting disk failure
CN111737067A (zh) * 2020-05-29 2020-10-02 苏州浪潮智能科技有限公司 一种硬盘故障预测模型解释方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845760A (zh) * 2018-05-28 2018-11-20 郑州云海信息技术有限公司 一种硬盘维护方法、装置、设备及可读存储介质
CN110164501A (zh) * 2018-06-29 2019-08-23 腾讯科技(深圳)有限公司 一种硬盘检测方法、装置、存储介质及设备
CN109491850A (zh) * 2018-11-21 2019-03-19 北京北信源软件股份有限公司 一种磁盘故障预测方法及装置
US20200233587A1 (en) * 2019-01-18 2020-07-23 EMC IP Holding Company LLC Method, device and computer product for predicting disk failure
CN111737067A (zh) * 2020-05-29 2020-10-02 苏州浪潮智能科技有限公司 一种硬盘故障预测模型解释方法及装置

Also Published As

Publication number Publication date
CN117271229A (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
US10223190B2 (en) Identification of storage system elements causing performance degradation
US20200133758A1 (en) Method, device, and computer program product for facilitating prediction of disk failure
CN112214369A (zh) 基于模型融合的硬盘故障预测模型建立方法及其应用
KR101948634B1 (ko) 스마트 컴퓨팅을 위한 시스템 자원의 장애 예측 방법
CN108491861A (zh) 基于多源多参量融合的输变电设备状态异常模式识别方法及装置
CN106897178A (zh) 一种基于极限学习机的慢盘检测方法及系统
US11429497B2 (en) Predicting and handling of slow disk
CN114943321A (zh) 一种针对硬盘的故障预测方法、装置及设备
CN110955550A (zh) 一种云平台故障定位方法、装置、设备及存储介质
CN109684141A (zh) 一种磁盘故障诊断方法、装置、终端及可读存储介质
US11734103B2 (en) Behavior-driven die management on solid-state drives
CN112819640B (zh) 一种面向微服务的金融回测容错系统及方法
CN110597655A (zh) 一种迁移与基于纠删码的重构相耦合的快速预知修复方法和实现
CN109196458A (zh) 存储系统可用容量计算方法及装置
CN110321067B (zh) 估计和管理存储设备退化的系统和方法
WO2023236753A1 (zh) 一种硬盘故障预测方法、装置、存储介质及电子装置
CN111858108B (zh) 一种硬盘故障预测方法、装置、电子设备和存储介质
US7356443B2 (en) Systems and methods for analyzing the selection of measurements of a communication network
WO2023061209A1 (zh) 内存故障的预测方法、电子设备和计算机可读存储介质
CN116775362A (zh) 独立冗余磁盘阵列的通路阻塞处理方法、系统
CN115509853A (zh) 一种集群数据异常检测方法及电子设备
CN115480948A (zh) 硬盘故障预测方法及相关设备
CN114003172A (zh) 存储容量校正方法、装置、计算机设备以及存储介质
CN117591337B (zh) 计算机信息数据交互传输管理系统及方法
CN103390429A (zh) 一种硬盘的在线检测方法及服务器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23818918

Country of ref document: EP

Kind code of ref document: A1