WO2022116922A1 - Magnetic disk failure prediction method, prediction model training method, and electronic device - Google Patents

Magnetic disk failure prediction method, prediction model training method, and electronic device Download PDF

Info

Publication number
WO2022116922A1
WO2022116922A1 PCT/CN2021/133728 CN2021133728W WO2022116922A1 WO 2022116922 A1 WO2022116922 A1 WO 2022116922A1 CN 2021133728 W CN2021133728 W CN 2021133728W WO 2022116922 A1 WO2022116922 A1 WO 2022116922A1
Authority
WO
WIPO (PCT)
Prior art keywords
prediction
training sample
disk
information
predicted
Prior art date
Application number
PCT/CN2021/133728
Other languages
French (fr)
Chinese (zh)
Inventor
宋顺
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022116922A1 publication Critical patent/WO2022116922A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis

Definitions

  • the present application relates to, but is not limited to, the field of data storage, and in particular, relates to a method for predicting disk failure, a method for training a prediction model, and an electronic device.
  • Disks are important hardware devices for data storage. For larger data centers, there are usually more disks. Disks typically have a limited lifespan, and at the end of their useful life, the chance of disk damage increases dramatically.
  • the replication technology or erasure coding technology is usually used for data redundancy, but it can only avoid data loss caused by the failure of a single disk. When multiple disks fail at the same time, there is still a risk of data loss.
  • SMART self-monitoring analysis and reporting technology
  • Embodiments of the present application provide a disk failure prediction method, a prediction model training method, and an electronic device.
  • an embodiment of the present application provides a method for predicting disk failure, including: acquiring a prediction data set of a disk to be predicted, where the prediction data set includes IO information of the input output (IO) of the prediction sample and the same The SMART information corresponding to the predicted sample 10, wherein, the predicted data set is collected in the cache disk acceleration scene of the to-be-predicted disk; the predicted data set is input into the pre-trained prediction model, and the described Predict the prediction result of the disk.
  • an embodiment of the present application further provides a method for training a prediction model, including: acquiring a prediction training sample set of a training sample disk, where the prediction training sample set includes training sample IO information of the training sample IO and training sample IO information related to the training sample The SMART information of the training sample corresponding to the sample 10, wherein the prediction training sample set is collected in the cache disk acceleration scene of the training sample disk; the prediction model is trained according to the prediction training sample set.
  • an embodiment of the present application further provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the following when executing the computer program.
  • the disk failure prediction method described in the first aspect, or the prediction model training method described in the second aspect is performed.
  • embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to execute the disk failure prediction method as described in the first aspect, or to execute the method as described in the first aspect.
  • the prediction model training method described in the second aspect is described in the fourth aspect.
  • FIG. 1 is a flowchart of a method for predicting disk failure provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a module framework provided by another embodiment of the present application.
  • FIG. 3 is a flowchart of determining a prediction result according to a prediction period provided by another embodiment of the present application.
  • FIG. 4 is a flowchart of determining a prediction result according to a periodic failure probability provided by another embodiment of the present application.
  • FIG. 5 is a flowchart of determining a prediction result according to the number of times determined to be high-risk periods provided by another embodiment of the present application.
  • FIG. 6 is a flowchart of determining that a disk to be predicted is in a cache disk acceleration scenario provided by another embodiment of the present application;
  • FIG. 7 is a flowchart of a predictive model training method provided by another embodiment of the present application.
  • FIG. 8 is a flowchart of determining that a training sample disk is in a cache disk acceleration scenario provided by another embodiment of the present application.
  • FIG. 9 is a flowchart of obtaining a prediction training sample set provided by another embodiment of the present application.
  • FIG. 10 is a flowchart of determining training sample IO information according to preset conditions provided by another embodiment of the present application.
  • FIG. 11 is a flowchart of training a prediction model according to a training period provided by another embodiment of the present application.
  • FIG. 12 is a flowchart of dividing the prediction sample training set into a training sample set and a test sample set provided by another embodiment of the present application;
  • FIG. 13 is a schematic structural diagram of an electronic device provided by another embodiment of the present application.
  • the present application provides a method for predicting disk failure, a method for training a prediction model, and an electronic device.
  • the method for predicting disk failure includes: acquiring a prediction data set of a disk to be predicted, where the prediction data set includes IO information of a predicted sample IO and an The SMART information corresponding to the predicted sample 10, wherein, the predicted data set is collected in the cache disk acceleration scene of the to-be-predicted disk; the predicted data set is input into the pre-trained prediction model, and the described Predict the prediction result of the disk.
  • the disk failure prediction can be performed for all types of disks in combination with IO information and SMART information, which effectively reduces the risk of data loss.
  • FIG. 1 is a flowchart of a method for predicting a disk failure provided by an embodiment of the present application.
  • the method for predicting a disk failure includes, but is not limited to, steps S110 and S120.
  • Step S110 obtaining the prediction data set of the disk to be predicted, the prediction data set including the IO information of the predicted sample IO and the SMART information corresponding to the predicted sample IO, wherein the predicted data set is collected from the cache disk acceleration scene of the to-be-predicted disk.
  • the IO information of each predicted sample IO of the disk to be predicted includes multiple attributes such as IO delay, IO size, and IO status information. Therefore, using IO information as the input of the prediction model can effectively alleviate the problem of insufficient attributes.
  • the minimum allowable time for predicting the sample IO can be determined by the number of read and write operations per second (Input Output per second, IOPS) of the disk, and the minimum allowable time is determined as the duration threshold.
  • the sum of the durations is greater than the duration threshold, and it can be determined that the predicted sample IOs belong to large-block IOs, that is, the disk working scenarios corresponding to IOs belong to the cache disk acceleration scenarios. It is worth noting that in the scenario of cache disk acceleration, it is easy to know that the IO of the issued disk tends to be large-block read and write, and the small-block IO is less.
  • the block layer usually has a maximum IO of 512K, so the range of IO size is relatively large. Small. This provides a basis for tracking the IO size of the disk.
  • IO Quality of Service
  • front-end applications will perform service capability matching, and the storage side will set the front-end quality of service (Quality of Service, QOS), back-end QOS, etc.
  • QOS effectively prevents IO bursts in most cases, and avoids that the IO queue depth is too large and the disk load is too heavy to provide stable services.
  • the IO in the cache disk acceleration scenario can be used for failure prediction.
  • the prediction data set in this embodiment includes both IO information and corresponding SMART information.
  • the above-mentioned corresponding SMART information can be the SMART information in the process of executing the predicted sample IO, or it can be based on the collected SMART information.
  • the SMART information is collected periodically, for example, once a day, and the specific collection method and period can be adjusted according to the actual situation, so that the SMART information and the IO information have a certain correlation.
  • the disk type information can be used as one of the selection features, so that the prediction results obtained by the prediction model can represent the Based on the failure risk of this type of disk, the disk type information can also be obtained as the input of the prediction model, so that the prediction model can predict the failure of different types of disks.
  • the disk type information may include a disk manufacturer, a disk model, a disk capacity, a disk serial number, and a rotation speed, which is not limited in this embodiment.
  • step S120 the prediction data set is input into the pre-trained prediction model, and the prediction result of the disk to be predicted is obtained.
  • the specific forecast timing can be determined once a day according to operational requirements to reduce the risk of data loss, or can be forecast after each IO information and SMART information is collected, which can be adjusted according to actual needs.
  • IO information is the final comprehensive information. It needs to be combined with some information of the disk itself to make a comprehensive judgment. It cannot be used for prediction alone. Otherwise, it is easy to give false alarms.
  • the SMART information is information representing the state of the disk parameters, so combining the characteristics of the IO information and the characteristics of the SMART information can more accurately predict the failure risk of the disk to be predicted.
  • the prediction model when the prediction model is used to obtain the prediction results, the prediction model can be trained according to the IO information and SMART information in advance.
  • the delay information in the IO information, the rate of change of each parameter in the SMART information, and the SMART information can be used.
  • the absolute value of the increase in the rate of change of each parameter is used as a training feature, and the feature is labeled and then input to the prediction module for training, so that the prediction model can obtain the prediction result of the disk to be predicted according to the above features.
  • the prediction result can be in any form, such as the current failure risk probability of the disk to be predicted, or a specific risk value, etc., or the corresponding prediction result can be determined according to a specific collection period, such as a prediction data set. It is the data within one week, and the predicted result is the risk probability of failure in the next week, which can reflect the failure risk of the disk to be predicted, which is not limited here.
  • a system architecture diagram for applying the disk failure prediction method of the present application may be as shown in FIG. 2 , which includes a prediction center and several proxy nodes, and both the prediction center and the proxy nodes may use electronic devices or servers form, which is not limited here.
  • the prediction center may include an alarm management module, a prediction module, and a disk information management module, wherein the alarm management module is configured to issue an alarm prompt when it is detected that the disk is at high risk, and the prediction module is configured to perform the prediction of the disk to be predicted according to the prediction data set.
  • the disk information management module is configured to receive and manage the IO information and SMART information sent by the agent node, and form a prediction data set.
  • the proxy node includes an IO module and a SMART module, wherein the IO module is configured to obtain the IO information of the disk of the proxy node, and filter the IO information according to preset rules, so that the filtered IO information can be used to form a prediction data set;
  • the SMART module is configured to collect SMART information of the disks of the agent nodes. It should be noted that this application does not involve the specific structural improvement of the proxy node and the prediction center, but only involves processing the collected data, which will not be repeated here.
  • the IO information further includes IO time information
  • step S120 in the embodiment shown in FIG. 1 also includes but is not limited to the following steps:
  • Step S310 determine the forecast period, and determine the period data set from the forecast data set according to the forecast period and the IO time information;
  • Step S320 according to the periodic data set and the prediction model, obtain the periodic failure probability of the disk to be predicted in the prediction period;
  • Step S330 determining the prediction result of the disk to be predicted according to the periodic failure probability.
  • the prediction period can be selected according to actual needs. For example, in order to determine the near-term failure risk prediction of the disk to be predicted, the prediction period is determined to be several days, one week, or two weeks. The failure risk can be determined as one month, which can be adjusted according to the actual demand. It can be understood that the prediction period can be any number, for example, data of one week, two weeks, and four weeks are acquired at the same time, and the prediction result of each prediction period is acquired, so that the failure risk prediction of the disk to be predicted is more accurate.
  • the collected prediction data set is stored in the disk information management module.
  • life cycle management can also be performed on the collected data.
  • the set prediction period is For one week, the life cycle management can be used to make the IO information and SMART information in the prediction data set in the disk information management module both the data collected within one week.
  • the specific life cycle management method is not an improvement made in this embodiment, and is not Repeat.
  • the IO time information can be the specific time information of the disk to be predicted to perform the IO, such as the time when the IO starts to be executed, or the time when the IO execution is completed, and the specific selection criteria are adjusted according to the actual situation. Yes, there are no restrictions here.
  • the cycle failure probability is the prediction cycle as the length, the failure probability of the disk to be predicted, for example, the prediction cycle is one week, and the obtained cycle data set is the IO information and SMART information collected in the past week, according to the above information. Perform failure prediction, and the obtained periodic failure probability is the failure probability of the disk to be predicted in the next week.
  • step S330 in the embodiment shown in FIG. 3 further includes but is not limited to the following steps:
  • Step S410 when the cycle failure probability is greater than a preset probability threshold corresponding to the prediction cycle, determine the prediction result as a high risk
  • Step S420 when the periodic failure probability is less than or equal to the probability threshold, the prediction result is determined to be low risk.
  • the probability threshold can be determined according to the actual risk management requirements. For example, if the probability threshold is greater than 80%, the failure risk is high, and the probability threshold is less than or equal to 80%. The failure risk is low. This embodiment does not specify the probability threshold. A limitation is made. Of course, several probability thresholds corresponding to several risk levels can also be set according to actual needs, which will not be repeated here.
  • using the risk level as the prediction result can reflect the failure probability of the detected disk, so that the disk replacement time can be arranged in advance when the prediction result is detected as high risk, especially to reduce the failure of multiple disks at the same time. situation, reducing the risk of data loss.
  • a probability threshold corresponding to a low risk an alarm will not be generated when the periodic failure probability is lower than the probability threshold, thereby effectively reducing the false alarm rate.
  • the cycle failure probability is the prediction result corresponding to the prediction cycle, and will not affect the prediction results of different prediction cycles.
  • the cycle failure probability with a prediction cycle of one week and the cycle failure with a prediction cycle of two weeks The probability is an independent parameter, that is, when determining the prediction result with a prediction period of two weeks, the failure probability of a period with a prediction period of one week will not be considered, which will not be repeated here.
  • step S410 in the embodiment shown in FIG. 4 further includes but is not limited to the following steps:
  • Step S510 when the periodic failure probability is greater than the probability threshold, determine that the prediction period corresponding to the periodic failure probability is a high-risk period;
  • Step S520 when the number of times that the prediction period is determined to be a high-risk period is greater than a preset alarm number threshold, the prediction result is determined to be a high-risk period.
  • using the alarm number threshold can effectively reduce the number of false alarms.
  • the specific alarm number threshold can be adjusted according to actual needs, which is not limited here.
  • alarm information may also be generated according to basic information of the disk to be predicted. It can be understood that the alarm information can be generated by the alarm management module shown in Figure 2, for example, the alarm information is pushed to the background management system, and the basic information of the disk is carried in the alarm information, so that the maintenance personnel can timely and accurately perform the maintenance of the disk. .
  • the IO information also includes the IO duration and the IO size, and the cache disk acceleration scenario of the disk to be predicted is determined by the following steps:
  • Step S610 obtaining the IOPS of the disk to be predicted, and determining the duration threshold according to the IOPS and IO size of the disk to be predicted;
  • Step S620 when the IO duration is greater than the duration threshold, it is determined that the disk to be predicted is in a cache disk acceleration scenario.
  • the IOPS of the disk to be predicted can be obtained in any way, such as reading the IOPS performance parameters of the disk to be predicted, or it can be obtained by performing several IO tests on the disk to be predicted by the actual test method.
  • the specific method is based on actual needs. Just select it.
  • the duration threshold can be obtained by dividing the IO size by IOPS. Since IOPS is used to represent the read and write operation capability of the disk, the duration threshold can be used to represent the minimum allowable time required for the disk to process a specific IO size. When the IO duration is If it is greater than the minimum allowable time, it can be determined that the IO execution is in the scenario of cache disk acceleration.
  • the SMART information includes at least one of the following:
  • the SMART information may include any available attributes, such as disk health score (SMART Health Status), accumulated start-stop cycles (Accumulated start-stop cycles), accumulated load-unload cycles (Accumulated load-unload cycles) , the number of growing bad sectors (Elements in grown defective list), the count of non-medium errors (Non-medium error count) and the number of uncorrectable errors, where the number of uncorrectable errors can include the number of uncorrectable read errors (Total uncorrected read errors) and total uncorrected write errors (Total uncorrected write errors), those skilled in the art have the motivation to increase or decrease specific disk parameters according to actual needs, which are not limited here.
  • the rate of change and incremental value of each disk parameter can be used.
  • the incremental value may be an absolute value of the incremental value, which can be used to characterize the variation range of the disk parameters. The greater the variation range of the disk parameters, the greater the risk of disk failure.
  • an embodiment of the present application further provides a prediction model training method, including but not limited to step S710 and step S720.
  • Step S710 obtain the prediction training sample set of the training sample disk, the prediction training sample set includes the training sample IO information of several training samples 10 and the training sample SMART information corresponding to the training sample 10, wherein, the prediction training sample set is collected in the training sample 10. Cache disk acceleration scenarios for sample disks.
  • the acquisition method of the prediction training sample set can be obtained from the cache disk acceleration scenario described in the embodiment shown in FIG. 1 through the IOPS performance model.
  • the IOPS performance model can be obtained by manually testing different large blocks of IO in different queues.
  • the IOPS performance under depth is obtained. It is understandable that the IOPS performance model can be used to characterize the read and write capabilities of the disk. Therefore, for a certain number of IOs, the estimated allowable time can be calculated by using the IOPS performance model, that is, the above-mentioned duration threshold. If the actual processing duration is greater than the duration threshold, it can be considered that the IO comes from a cache disk acceleration scenario and can be determined as a valid sample. It can be understood that the prediction training sample set can be collected periodically, for example, collected once a day, and the specific period can be selected according to actual needs.
  • Step S720 train the prediction model according to the prediction training sample set.
  • the training of the prediction model may be trained once a day, or may be adjusted according to actual needs, which is not limited herein. It can be understood that when the training sample set includes several sample subsets, training can be carried out according to the sample subsets. Training is performed for the corresponding period, so that the prediction model can perform failure prediction for different periods.
  • the prediction model may use a common model framework, such as the LightGBM framework.
  • a common model framework such as the LightGBM framework.
  • the framework parameters can be set according to the following table 1:
  • the training sample IO information includes the training sample IO duration and the training sample IO size
  • step S710 in the embodiment shown in FIG. 7 also includes but is not limited to the following steps:
  • Step S810 obtaining the IOPS of the training sample disk, and determining the training sample duration threshold according to the IOPS of the training sample disk and the IO size of the training sample;
  • Step S820 when the training sample IO duration is greater than the training sample duration threshold, it is determined that the training sample disk is in a cache disk acceleration scenario.
  • step S810 in the embodiment shown in FIG. 8 further includes but is not limited to the following steps:
  • Step S910 determining that all IOs of the training sample disk in the cache disk acceleration scenario are candidate IOs
  • Step S920 determining the training sample 10 from the candidate 10 according to the preset condition, and determining the IO information of the training sample 10 as the training sample 10 information;
  • Step S930 obtains the training sample SMART information corresponding to the training sample 10 from the SMART information of the training sample disk;
  • Step S940 preprocess the training sample IO information and the training sample SMART information, and generate a prediction training sample set according to the preprocessed training sample IO information and the training sample SMART information.
  • the IOs in the cache disk acceleration scenario need to be determined as alternative IOs first. Then, the training sample IO is screened from the candidate IO according to the preset conditions.
  • the preprocessing performed on the training sample can check the validity of the training sample, check whether the training sample meets the time requirement, and can also increase or decrease the corresponding data according to actual needs. operations, such as handling the imbalance of positive and negative samples, etc., will not be repeated here. It is understandable that checking the legitimacy of training samples is mainly used to ensure that the candidate IOs obtained are continuous, and to avoid IOs whose acquisition process is interrupted as training samples. For example, during the execution of a certain IO, the disk is powered off. , then the IO is discontinuous IO, and its IO information has a large deviation, which cannot be applied to training, so this type of samples can be removed by preprocessing. It is understandable that checking whether the training samples meet the time requirements can be determined according to the set training period. For example, if the longest training period is set to be four weeks, then the training samples before four weeks will be removed to ensure the timeliness of the data. .
  • the IO information includes time delay information, status information, and IO time information.
  • the proportion is appropriately weighted, so that the proportion of higher delay segments occupies a higher voice, highlighting the health threat caused by high delay; IO error rate, the status information in each delay segment is IO error IO
  • the proportion of the overall IO; the average value of the first N delays in each delay segment in descending order, and the value of N can be selected according to actual needs. It can be understood that most of the SMART information is statistical data, so the rate of change and the absolute value of each statistical data in the SMART information can be obtained to achieve feature expansion, which is not repeated here.
  • the training sample IO information further includes state information, delay information and IO time information
  • the preset condition includes at least one of the following:
  • Status information is an error status used to characterize IO errors
  • the IO size is greater than the preset IO size threshold
  • the IO time information conforms to the preset sample collection cycle
  • the delay information satisfies a preset delay distribution interval.
  • the preset condition may be judged and data filtered by the IO module shown in FIG. 2 , for example, a statistics list may be determined according to the collection period.
  • a statistics list may be determined according to the collection period.
  • the IO module After the IO module obtains the candidate IO, it first judges whether the status information of the candidate IO is an IO error, and if so, directly adds it to the statistics list, and collects the IO whose status information is an IO error and uses it for training, so that the prediction model can be more accurate. Predict the probability that the disk may be in error. If the status information of the candidate IO is correct, then determine whether the IO size satisfies the IO size threshold. Based on the analysis of the above embodiment, the delay characteristics of large block IO can be used to predict disk failure. Therefore, this The embodiment may only count IOs of a specific size, for example, only count IOs in the range of 128K to 512K.
  • the candidate IO is added to the candidate list, otherwise the number of IOs is judged , to avoid insufficient collection of IO.
  • the candidate IOs in the statistical list and the candidate IOs can be determined as training sample IOs without collecting the current IOs.
  • the collection time of the selected IO does not meet the collection period, and since the collection of the candidate IO is collected in chronological order, it can be determined that the collection of the candidate IO has expired at this time, and the candidate IO can be cleaned up and stopped. Collection, if the collection period does not exceed, the candidate IO is a valid IO, and it is added to the statistics list, and the candidate IO in the statistics list and the candidate list is determined as the training sample IO.
  • step S720 in the embodiment shown in FIG. 7 further includes but is not limited to the following steps:
  • Step S1110 obtaining a preset training period, and determining a period sample set corresponding to the training period according to the IO time information of the training sample IO;
  • Step S1120 train the prediction model according to the periodic sample set.
  • the training period can be selected according to actual needs, for example, according to the current time, the training samples of the first week, the first two weeks and the first four weeks are obtained as the period sample set, so that the prediction model can be performed according to different prediction periods. Disk failure prediction.
  • the prediction data set can be collected according to the same period when performing disk failure prediction, so as to obtain the prediction result in the corresponding period.
  • the training sample SMART information includes at least one of the following:
  • step S720 in the embodiment shown in FIG. 7 further includes but is not limited to the following steps:
  • Step S1210 dividing a training sample set and a test sample set from the prediction training sample set according to a preset ratio
  • Step S1220 Train the prediction model according to the training sample set, and perform a verification test on the trained prediction model according to the test sample set.
  • the preset ratio may be any value, which can be adjusted according to actual requirements, for example, the training sample set and the test sample set are divided according to the ratio of 8:2.
  • the feature expansion operation in the foregoing embodiment may be performed before or after segmentation of the prediction training sample set, which is not limited in this embodiment.
  • thresholds can be set for judgment, such as false discovery rate (False Discovery Rate, FDR), false acceptance rate (False Accept Rate FAR) ), the specific threshold setting standard can be adjusted according to actual needs, which is not limited here.
  • an embodiment of the present application further provides an electronic device, the electronic device 1300 includes: a memory 1310 , a processor 1320 , and a computer program stored on the memory 1310 and executable on the processor 1320 .
  • the processor 1320 and the memory 1310 may be connected by a bus or otherwise.
  • the non-transitory software programs and instructions required to implement the disk failure prediction method of the above embodiment are stored in the memory 1310, and when executed by the processor 1320, the disk failure prediction method applied to the electronic device 1300 in the above embodiment is executed, For example, performing the above-described method steps S110 to S120 in FIG. 1 , method steps S310 to S330 in FIG. 3 , method steps S410 to S420 in FIG. 4 , and method steps S510 to S520 in FIG. 5 , Method steps S610 to S620 in FIG. 6 , method steps S710 to S720 in FIG. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 to S940 in FIG. 9 , method steps in FIG. 11 Steps S1110 to S1120, steps S1210 to S1220 of the method in FIG. 12 .
  • the embodiment of the present application includes: acquiring a prediction data set of a disk to be predicted, the prediction data set including IO information of the prediction sample 10 and SMART information corresponding to the prediction sample 10, wherein the prediction data set is collected in the Describe the cache disk acceleration scene of the disk to be predicted; input the predicted data set into a pre-trained prediction model, and obtain the prediction result of the disk to be predicted.
  • the disk failure prediction can be performed for all types of disks in combination with IO information and SMART information, which effectively reduces the risk of data loss.
  • an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or controller, for example, by the above-mentioned Executed by a processor in the embodiment of the electronic device, the above-mentioned processor can execute the disk failure prediction method applied to the electronic device in the above-mentioned embodiment, for example, execute the above-described method steps S110 to S120 in FIG. 1 .
  • step S720 the method steps S810 to S820 in FIG. 8, the method steps S910 to S940 in FIG. 9, the method steps S1110 to S1120 in FIG. 11, and the method steps S1210 to S1220 in FIG. 12.
  • a processor such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit .
  • Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • Computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A magnetic disk failure prediction method, a prediction model training method, and an electronic device. The magnetic disk failure prediction method comprises: acquiring a prediction data set of a magnetic disk to be predicted, wherein the prediction data set comprises IO information of a prediction sample IO and SMART information corresponding to the prediction sample IO, and the prediction data set is collected from a cache disk acceleration scenario of said magnetic disk (S110); and inputting the prediction data set into a pre-trained prediction model, so as to obtain a prediction result of said magnetic disk (S120).

Description

磁盘失效预测方法、预测模型训练方法、电子设备Disk failure prediction method, prediction model training method, electronic equipment
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请基于申请号为202011394121.0、申请日为2020年12月3日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on the Chinese patent application with the application number of 202011394121.0 and the filing date of December 3, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.
技术领域technical field
本申请涉及但不限于数据存储领域,尤其涉及一种磁盘失效预测方法、预测模型训练方法、电子设备。The present application relates to, but is not limited to, the field of data storage, and in particular, relates to a method for predicting disk failure, a method for training a prediction model, and an electronic device.
背景技术Background technique
随着网络技术和通信技术的发展,服务器数据中心的数据存储量快速增加。磁盘是数据存储的重要硬件设备,对于较为大型的数据中心,磁盘的数量通常较多。磁盘的使用寿命通常有限,在使用寿命的末期,磁盘的损坏几率会大幅增加。为了解决这个问题,通常采用副本技术或基于纠删码技术进行数据冗余,但是只能避免单个磁盘失效导致的数据丢失,当多个磁盘同时失效,依然存在数据丢失的风险。With the development of network technology and communication technology, the amount of data storage in server data centers increases rapidly. Disks are important hardware devices for data storage. For larger data centers, there are usually more disks. Disks typically have a limited lifespan, and at the end of their useful life, the chance of disk damage increases dramatically. In order to solve this problem, the replication technology or erasure coding technology is usually used for data redundancy, but it can only avoid data loss caused by the failure of a single disk. When multiple disks fail at the same time, there is still a risk of data loss.
基于此,通常需要在磁盘运行过程中对磁盘进行失效预测,在检测到失效风险较高的情况下及时更换磁盘,从而减少数据丢失的风险。常见的做法是采用训练好的预测模型进行失效预测,但是现有的预测模型所采用的训练数据通常是磁盘的自我监测分析及报告技术(Self-Monitoring Analysis and Reporting Technology,SMART)信息,但是,SMART信息只能适用于磁盘参数种类较多串行高级技术(Serial Advanced Technology Attachment,SATA)机械盘,对于磁盘参数较少的串行连接小型计算机系统接口(Serial Attached SCSI,SAS,其中,小型计算机系统接口(Small Computer System Interface,SCSI))磁盘则无法得出准确的预测。Based on this, it is usually necessary to predict the failure of the disk during the operation of the disk, and replace the disk in time when a high failure risk is detected, thereby reducing the risk of data loss. A common practice is to use a trained prediction model for failure prediction, but the training data used by the existing prediction model is usually the self-monitoring analysis and reporting technology (SMART) information of the disk, but, SMART information can only be applied to serial advanced technology (Serial Advanced Technology Attachment, SATA) mechanical disks with many types of disk parameters. System Interface (Small Computer System Interface, SCSI) disks cannot make accurate predictions.
发明内容SUMMARY OF THE INVENTION
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this article. This summary is not intended to limit the scope of protection of the claims.
本申请实施例提供了一种磁盘失效预测方法、预测模型训练方法、电子设备。Embodiments of the present application provide a disk failure prediction method, a prediction model training method, and an electronic device.
第一方面,本申请实施例提供了一种磁盘失效预测方法,包括:获取待预测磁盘的预测数据集,所述预测数据集包括预测样本输入输出(Input Output,IO)的IO信息和与所述预测样本IO相对应的SMART信息,其中,所述预测数据集采集于所述待预测磁盘的缓存盘加速场景;将所述预测数据集输入至预先训练好的预测模型,得出所述待预测磁盘的预测结果。In a first aspect, an embodiment of the present application provides a method for predicting disk failure, including: acquiring a prediction data set of a disk to be predicted, where the prediction data set includes IO information of the input output (IO) of the prediction sample and the same The SMART information corresponding to the predicted sample 10, wherein, the predicted data set is collected in the cache disk acceleration scene of the to-be-predicted disk; the predicted data set is input into the pre-trained prediction model, and the described Predict the prediction result of the disk.
第二方面,本申请实施例还提供了一种预测模型训练方法,包括:获取训练样本磁盘的预测训练样本集,所述预测训练样本集包括训练样本IO的训练样本IO信息和与所述训练样本IO相对应的训练样本SMART信息,其中,所述预测训练样本集采集于所述训练样本磁盘的缓存盘加速场景;根据所述预测训练样本集训练所述预测模型。In a second aspect, an embodiment of the present application further provides a method for training a prediction model, including: acquiring a prediction training sample set of a training sample disk, where the prediction training sample set includes training sample IO information of the training sample IO and training sample IO information related to the training sample The SMART information of the training sample corresponding to the sample 10, wherein the prediction training sample set is collected in the cache disk acceleration scene of the training sample disk; the prediction model is trained according to the prediction training sample set.
第三方面,本申请实施例还提供了一种电子设备,包括:存储器、处理器及存储在存储 器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的磁盘失效预测方法,或者执行如第二方面所述的预测模型训练方法。In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the following when executing the computer program. The disk failure prediction method described in the first aspect, or the prediction model training method described in the second aspect is performed.
第四方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如第一方面所述的磁盘失效预测方法,或者执行如第二方面所述的预测模型训练方法。In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to execute the disk failure prediction method as described in the first aspect, or to execute the method as described in the first aspect. The prediction model training method described in the second aspect.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the description, claims and drawings.
附图说明Description of drawings
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solutions of the present application, and constitute a part of the specification. They are used to explain the technical solutions of the present application together with the embodiments of the present application, and do not constitute a limitation on the technical solutions of the present application.
图1是本申请一个实施例提供的磁盘失效预测方法的流程图;1 is a flowchart of a method for predicting disk failure provided by an embodiment of the present application;
图2是本申请另一个实施例提供的模块框架示意图;2 is a schematic diagram of a module framework provided by another embodiment of the present application;
图3是本申请另一个实施例提供的根据预测周期确定预测结果的流程图;3 is a flowchart of determining a prediction result according to a prediction period provided by another embodiment of the present application;
图4是本申请另一个实施例提供的根据周期失效概率确定预测结果的流程图;4 is a flowchart of determining a prediction result according to a periodic failure probability provided by another embodiment of the present application;
图5是本申请另一个实施例提供的根据被确定为高风险周期的次数确定预测结果的流程图;5 is a flowchart of determining a prediction result according to the number of times determined to be high-risk periods provided by another embodiment of the present application;
图6是本申请另一个实施例提供的确定待预测磁盘处于缓存盘加速场景的流程图;6 is a flowchart of determining that a disk to be predicted is in a cache disk acceleration scenario provided by another embodiment of the present application;
图7是本申请另一个实施例提供的预测模型训练方法的流程图;7 is a flowchart of a predictive model training method provided by another embodiment of the present application;
图8是本申请另一个实施例提供的确定训练样本磁盘处于缓存盘加速场景的流程图;8 is a flowchart of determining that a training sample disk is in a cache disk acceleration scenario provided by another embodiment of the present application;
图9是本申请另一个实施例提供的获取预测训练样本集的流程图;9 is a flowchart of obtaining a prediction training sample set provided by another embodiment of the present application;
图10是本申请另一个实施例提供的根据预设条件确定训练样本IO信息的流程图;10 is a flowchart of determining training sample IO information according to preset conditions provided by another embodiment of the present application;
图11是本申请另一个实施例提供的根据训练周期训练预测模型的流程图;11 is a flowchart of training a prediction model according to a training period provided by another embodiment of the present application;
图12是本申请另一个实施例提供的将预测样本训练集分为训练样本集和测试样本集的流程图;12 is a flowchart of dividing the prediction sample training set into a training sample set and a test sample set provided by another embodiment of the present application;
图13是本申请另一个实施例提供的电子设备的结构示意图。FIG. 13 is a schematic structural diagram of an electronic device provided by another embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, the modules may be divided differently from the device, or executed in the order in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification, claims or the above drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
本申请提供了一种磁盘失效预测方法、预测模型训练方法、电子设备,该磁盘失效预测方法包括:获取待预测磁盘的预测数据集,所述预测数据集包括预测样本IO的IO信息和与所述预测样本IO相对应的SMART信息,其中,所述预测数据集采集于所述待预测磁盘的缓存 盘加速场景;将所述预测数据集输入至预先训练好的预测模型,得出所述待预测磁盘的预测结果。根据本申请实施例提供的方案,能够结合IO信息和SMART信息,对所有类型的磁盘进行磁盘失效预测,有效降低了数据丢失的风险。The present application provides a method for predicting disk failure, a method for training a prediction model, and an electronic device. The method for predicting disk failure includes: acquiring a prediction data set of a disk to be predicted, where the prediction data set includes IO information of a predicted sample IO and an The SMART information corresponding to the predicted sample 10, wherein, the predicted data set is collected in the cache disk acceleration scene of the to-be-predicted disk; the predicted data set is input into the pre-trained prediction model, and the described Predict the prediction result of the disk. According to the solution provided by the embodiment of the present application, the disk failure prediction can be performed for all types of disks in combination with IO information and SMART information, which effectively reduces the risk of data loss.
下面结合附图,对本申请实施例作进一步阐述。The embodiments of the present application will be further described below with reference to the accompanying drawings.
如图1所示,图1是本申请一个实施例提供的磁盘失效预测方法的流程图,该磁盘失效预测方法包括但不限于有步骤S110和步骤S120。As shown in FIG. 1 , FIG. 1 is a flowchart of a method for predicting a disk failure provided by an embodiment of the present application. The method for predicting a disk failure includes, but is not limited to, steps S110 and S120.
步骤S110,获取待预测磁盘的预测数据集,预测数据集包括预测样本IO的IO信息和与预测样本IO相对应的SMART信息,其中,预测数据集采集于待预测磁盘的缓存盘加速场景。Step S110, obtaining the prediction data set of the disk to be predicted, the prediction data set including the IO information of the predicted sample IO and the SMART information corresponding to the predicted sample IO, wherein the predicted data set is collected from the cache disk acceleration scene of the to-be-predicted disk.
需要说明的是,待预测磁盘每个预测样本IO的IO信息包括IO时延、IO大小、IO状态信息等多个属性,因此采用IO信息作为预测模型的输入,可以有效缓解属性不足的问题。It should be noted that the IO information of each predicted sample IO of the disk to be predicted includes multiple attributes such as IO delay, IO size, and IO status information. Therefore, using IO information as the input of the prediction model can effectively alleviate the problem of insufficient attributes.
需要说明的是,本实施例可以通过磁盘的每秒读写操作次数(Input Output per second,IOPS)确定预测样本IO最小的允许时间,将该最小的允许时间确定为时长阈值,当若干个IO的时长之和大于该时长阈值,可以确定预测样本IO均属于大块IO,即IO所对应的磁盘工作场景属于缓存盘加速场景。值得注意的是,在缓存盘加速的场景下,容易知道下发磁盘的IO是趋向于大块读写,小块IO较少,同时块层通常最大IO是512K,因此IO大小的范围相对较小。这为跟踪磁盘的IO大小提供了依据。本领域技术人员可以理解的是,在存储系统中,为保证存储服务质量,通常前端应用会进行服务能力匹配、存储侧会设置前端服务质量(Quality of Service,QOS)、后端QOS等,这些QOS有效地防止了IO在绝大部分情况下地突发,避免IO队列深度过大,磁盘负荷过重从而无法提供稳定地服务。综上所述,在缓存盘加速的场景下,统计大块IO在一定负载下的时延信息具有明显的意义,从应用层面提供了丰富的磁盘状态标识,因此,缓存盘加速场景下的IO信息能够用于进行失效预测。It should be noted that, in this embodiment, the minimum allowable time for predicting the sample IO can be determined by the number of read and write operations per second (Input Output per second, IOPS) of the disk, and the minimum allowable time is determined as the duration threshold. The sum of the durations is greater than the duration threshold, and it can be determined that the predicted sample IOs belong to large-block IOs, that is, the disk working scenarios corresponding to IOs belong to the cache disk acceleration scenarios. It is worth noting that in the scenario of cache disk acceleration, it is easy to know that the IO of the issued disk tends to be large-block read and write, and the small-block IO is less. At the same time, the block layer usually has a maximum IO of 512K, so the range of IO size is relatively large. Small. This provides a basis for tracking the IO size of the disk. Those skilled in the art can understand that, in a storage system, in order to ensure the quality of storage services, usually front-end applications will perform service capability matching, and the storage side will set the front-end quality of service (Quality of Service, QOS), back-end QOS, etc. QOS effectively prevents IO bursts in most cases, and avoids that the IO queue depth is too large and the disk load is too heavy to provide stable services. To sum up, in the scenario of cache disk acceleration, it is of obvious significance to count the delay information of large IO blocks under a certain load, and provides a wealth of disk status identifiers from the application level. Therefore, the IO in the cache disk acceleration scenario The information can be used for failure prediction.
值得注意的是,本实施例的预测数据集既包括IO信息,也包括相对应的SMART信息,上述相对应的SMART信息,可以是执行该预测样本IO过程中的SMART信息,也可以是根据采集周期对SMART信息进行采集,例如每天采集一次,具体的采集方法和周期根据实际情况调整即可,能够使得SMART信息与IO信息有一定的关联性即可。It is worth noting that the prediction data set in this embodiment includes both IO information and corresponding SMART information. The above-mentioned corresponding SMART information can be the SMART information in the process of executing the predicted sample IO, or it can be based on the collected SMART information. The SMART information is collected periodically, for example, once a day, and the specific collection method and period can be adjusted according to the actual situation, so that the SMART information and the IO information have a certain correlation.
需要说明的是,由于不同类型的磁盘的物理属性不同,因此其失效的标准不同,在对预测模型进行训练时可以将磁盘类型信息作为其中一个选择特征,使得预测模型得出的预测结果能够表征该类型磁盘的失效风险,基于此,还可以获取磁盘类型信息作为预测模型的输入,从而使得预测模型能够对不同类型的磁盘进行失效预测。可以理解的是,磁盘类型信息可以包括磁盘厂商、磁盘型号、磁盘容量、磁盘序列号和转速,本实施例对此不多作限定。It should be noted that due to the different physical properties of different types of disks, their failure criteria are different. When training the prediction model, the disk type information can be used as one of the selection features, so that the prediction results obtained by the prediction model can represent the Based on the failure risk of this type of disk, the disk type information can also be obtained as the input of the prediction model, so that the prediction model can predict the failure of different types of disks. It can be understood that, the disk type information may include a disk manufacturer, a disk model, a disk capacity, a disk serial number, and a rotation speed, which is not limited in this embodiment.
步骤S120,将预测数据集输入至预先训练好的预测模型,得出待预测磁盘的预测结果。In step S120, the prediction data set is input into the pre-trained prediction model, and the prediction result of the disk to be predicted is obtained.
在一实施例中,具体的预测时机可以根据运营需求制定每天进行一次预测,减少数据丢失的风险,也可以是每采集一次IO信息和SMART信息后进行一次预测,根据实际需求调整即可。In one embodiment, the specific forecast timing can be determined once a day according to operational requirements to reduce the risk of data loss, or can be forecast after each IO information and SMART information is collected, which can be adjusted according to actual needs.
值得注意的是,由于IO路径上不仅受磁盘本身的影响,还要受到控制器、扩展卡、线缆甚至操作系统的影响。IO信息是最终一个综合的信息,需要结合磁盘本身的一些信息来进行综合判断,并不能单独用来预测,否则容易误告警,比如将扩展卡某个物理口有问题告警成磁盘失效。而SMART信息为表征磁盘参数状态的信息,因此结合IO信息的特征与SMART信息的特征能够更加准确地预测出待预测磁盘的失效风险。It is worth noting that because the IO path is not only affected by the disk itself, but also by the controller, expansion cards, cables and even the operating system. IO information is the final comprehensive information. It needs to be combined with some information of the disk itself to make a comprehensive judgment. It cannot be used for prediction alone. Otherwise, it is easy to give false alarms. The SMART information is information representing the state of the disk parameters, so combining the characteristics of the IO information and the characteristics of the SMART information can more accurately predict the failure risk of the disk to be predicted.
可以理解的是,采用预测模型进行预测结果的获取,可以预先根据IO信息和SMART信息对预测模型进行训练,例如可以采用IO信息中的时延信息、SMART信息中各参数的变化率、SMART信息中各参数的变化率增加的绝对值作为训练的特征,将特征进行标注之后输入至预测模块进行训练,使得预测模型能够根据上述特征得出待预测磁盘的预测结果。It can be understood that, when the prediction model is used to obtain the prediction results, the prediction model can be trained according to the IO information and SMART information in advance. For example, the delay information in the IO information, the rate of change of each parameter in the SMART information, and the SMART information can be used. The absolute value of the increase in the rate of change of each parameter is used as a training feature, and the feature is labeled and then input to the prediction module for training, so that the prediction model can obtain the prediction result of the disk to be predicted according to the above features.
在一实施例中,预测结果可以采用任意形式,例如是待预测磁盘当前的失效风险概率,或者是具体的风险数值等,也可以根据特定的采集周期确定对应的预测结果,例如是预测数据集为一周内的数据,得出的预测结果为未来一周失效的风险概率,能够反映出待预测磁盘的失效风险即可,在此不多做限定。In one embodiment, the prediction result can be in any form, such as the current failure risk probability of the disk to be predicted, or a specific risk value, etc., or the corresponding prediction result can be determined according to a specific collection period, such as a prediction data set. It is the data within one week, and the predicted result is the risk probability of failure in the next week, which can reflect the failure risk of the disk to be predicted, which is not limited here.
另外,在一实施例中,应用本申请的磁盘失效预测方法的系统架构图可以如图2所示,其中,包括预测中心和若干个代理节点,预测中心和代理节点均可以采用电子设备或者服务器的形式,在此不多作限定。预测中心可以包括告警管理模块、预测模块和磁盘信息管理模块,其中,告警管理模块被设置成在检测到磁盘处于高风险时进行告警提示,预测模块被设置成根据预测数据集进行待预测磁盘的失效风险预测,磁盘信息管理模块被设置成接收并管理代理节点发送的IO信息和SMART信息,并形成预测数据集。代理节点包括IO模块和SMART模块,其中,IO模块被设置成获取代理节点的磁盘的IO信息,并根据预设规则对IO信息进行筛选,使得筛选后的IO信息能够用于形成预测数据集;SMART模块被设置成采集代理节点的磁盘的SMART信息。需要说明的是,本申请并不涉及代理节点和预测中心的具体结构改进,仅涉及对采集的数据进行处理,在此不多做赘述。In addition, in an embodiment, a system architecture diagram for applying the disk failure prediction method of the present application may be as shown in FIG. 2 , which includes a prediction center and several proxy nodes, and both the prediction center and the proxy nodes may use electronic devices or servers form, which is not limited here. The prediction center may include an alarm management module, a prediction module, and a disk information management module, wherein the alarm management module is configured to issue an alarm prompt when it is detected that the disk is at high risk, and the prediction module is configured to perform the prediction of the disk to be predicted according to the prediction data set. For failure risk prediction, the disk information management module is configured to receive and manage the IO information and SMART information sent by the agent node, and form a prediction data set. The proxy node includes an IO module and a SMART module, wherein the IO module is configured to obtain the IO information of the disk of the proxy node, and filter the IO information according to preset rules, so that the filtered IO information can be used to form a prediction data set; The SMART module is configured to collect SMART information of the disks of the agent nodes. It should be noted that this application does not involve the specific structural improvement of the proxy node and the prediction center, but only involves processing the collected data, which will not be repeated here.
另外,参照图3,在一实施例中,IO信息还包括IO时间信息,图1所示实施例中的步骤S120还包括但不限于有以下步骤:In addition, referring to FIG. 3 , in one embodiment, the IO information further includes IO time information, and step S120 in the embodiment shown in FIG. 1 also includes but is not limited to the following steps:
步骤S310,确定预测周期,根据预测周期和IO时间信息从预测数据集中确定周期数据集;Step S310, determine the forecast period, and determine the period data set from the forecast data set according to the forecast period and the IO time information;
步骤S320,根据周期数据集和预测模型,得出待预测磁盘在预测周期中的周期失效概率;Step S320, according to the periodic data set and the prediction model, obtain the periodic failure probability of the disk to be predicted in the prediction period;
步骤S330,根据周期失效概率确定待预测磁盘的预测结果。Step S330, determining the prediction result of the disk to be predicted according to the periodic failure probability.
在一实施例中,预测周期可以根据实际需求选取,例如为了确定待预测磁盘近期的失效风险预测,将预测周期确定为若干天或者一周、两周,若为了确定待预测磁盘较长一段时间后的失效风险,可以将预测周期确定为一个月,具体根据实际需求调整即可。可以理解的是,预测周期可以是任意数量,例如同时获取一周、两周和四周的数据,并获取每个预测周期的预测结果,使得待预测磁盘的失效风险预测更加准确。In one embodiment, the prediction period can be selected according to actual needs. For example, in order to determine the near-term failure risk prediction of the disk to be predicted, the prediction period is determined to be several days, one week, or two weeks. The failure risk can be determined as one month, which can be adjusted according to the actual demand. It can be understood that the prediction period can be any number, for example, data of one week, two weeks, and four weeks are acquired at the same time, and the prediction result of each prediction period is acquired, so that the failure risk prediction of the disk to be predicted is more accurate.
在一实施例中,如图2所示,磁盘信息管理模块中保存有采集到的预测数据集,为了减少存储压力,还可以对采集的数据进行生命周期管理,例如设定好的预测周期为一周,则可以通过生命周期管理,使得磁盘信息管理模块中的预测数据集中的IO信息和SMART信息均为一周内采集的数据,具体的生命周期管理方法并非本实施例作出的改进,在此不再赘述。In one embodiment, as shown in FIG. 2 , the collected prediction data set is stored in the disk information management module. In order to reduce the storage pressure, life cycle management can also be performed on the collected data. For example, the set prediction period is For one week, the life cycle management can be used to make the IO information and SMART information in the prediction data set in the disk information management module both the data collected within one week. The specific life cycle management method is not an improvement made in this embodiment, and is not Repeat.
在一实施例中,IO时间信息可以是待预测磁盘执行该IO的具体时间信息,例如该IO开始执行的时间,也可以是该IO执行完成的时间,具体的选取的标准根据实际情况调整即可,在此不多作限定。In one embodiment, the IO time information can be the specific time information of the disk to be predicted to perform the IO, such as the time when the IO starts to be executed, or the time when the IO execution is completed, and the specific selection criteria are adjusted according to the actual situation. Yes, there are no restrictions here.
在一实施例中,周期失效概率为以预测周期为长度,待预测磁盘的失效概率,例如预测周期为一周,获取的周期数据集为过去一周内采集到的IO信息和SMART信息,根据上述信息进行失效预测,所得出的周期失效概率为该待预测磁盘未来一周内的失效概率。In one embodiment, the cycle failure probability is the prediction cycle as the length, the failure probability of the disk to be predicted, for example, the prediction cycle is one week, and the obtained cycle data set is the IO information and SMART information collected in the past week, according to the above information. Perform failure prediction, and the obtained periodic failure probability is the failure probability of the disk to be predicted in the next week.
另外,参照图4,在一实施例中,图3所示实施例中的步骤S330还包括但不限于有以下步骤:In addition, referring to FIG. 4 , in an embodiment, step S330 in the embodiment shown in FIG. 3 further includes but is not limited to the following steps:
步骤S410,当周期失效概率大于与预测周期相对应的预设的概率阈值,将预测结果确定为高风险;Step S410, when the cycle failure probability is greater than a preset probability threshold corresponding to the prediction cycle, determine the prediction result as a high risk;
步骤S420,当周期失效概率小于或等于概率阈值,将预测结果确定为低风险。Step S420, when the periodic failure probability is less than or equal to the probability threshold, the prediction result is determined to be low risk.
需要说明的是,概率阈值可以根据实际的风险管理需求确定,例如设定概率阈值大于80%为失效风险较高,小于或等于80%为失效风险较低,本实施例并不对具体的概率阈值作出限定,当然,也可以根据实际需求设定若干个风险等级所对应的若干个概率阈值,在此不再赘述。It should be noted that the probability threshold can be determined according to the actual risk management requirements. For example, if the probability threshold is greater than 80%, the failure risk is high, and the probability threshold is less than or equal to 80%. The failure risk is low. This embodiment does not specify the probability threshold. A limitation is made. Of course, several probability thresholds corresponding to several risk levels can also be set according to actual needs, which will not be repeated here.
需要说明的是,采用风险等级作为预测结果,能够体现探测磁盘的故障概率,以便于在检测到预测结果为高风险的情况下提前安排更换磁盘的时间,尤其可以减少多个磁盘同时出现故障的情况,减少数据丢失的风险。同时,还可以通过设置低风险所对应的概率阈值,使得周期失效概率低于该概率阈值的情况下不产生告警,有效减少误告警率。It should be noted that using the risk level as the prediction result can reflect the failure probability of the detected disk, so that the disk replacement time can be arranged in advance when the prediction result is detected as high risk, especially to reduce the failure of multiple disks at the same time. situation, reducing the risk of data loss. At the same time, by setting a probability threshold corresponding to a low risk, an alarm will not be generated when the periodic failure probability is lower than the probability threshold, thereby effectively reducing the false alarm rate.
可以理解的是,周期失效概率为与预测周期相对应的预测结果,并不会对不同预测周期的预测结果造成影响,例如,预测周期为一周的周期失效概率和预测周期为二周的周期失效概率为相互独立的参数,即确定预测周期为二周的预测结果时并不会考虑预测周期为一周的周期失效概率,在此不多作赘述。It can be understood that the cycle failure probability is the prediction result corresponding to the prediction cycle, and will not affect the prediction results of different prediction cycles. For example, the cycle failure probability with a prediction cycle of one week and the cycle failure with a prediction cycle of two weeks The probability is an independent parameter, that is, when determining the prediction result with a prediction period of two weeks, the failure probability of a period with a prediction period of one week will not be considered, which will not be repeated here.
另外,参照图5,在一实施例中,图4所示实施例中的步骤S410还包括但不限于有以下步骤:In addition, referring to FIG. 5 , in an embodiment, step S410 in the embodiment shown in FIG. 4 further includes but is not limited to the following steps:
步骤S510,当周期失效概率大于概率阈值,确定周期失效概率所对应的预测周期为高风险周期;Step S510, when the periodic failure probability is greater than the probability threshold, determine that the prediction period corresponding to the periodic failure probability is a high-risk period;
步骤S520,当预测周期被确定为高风险周期的次数大于预先设定的告警数阈值,将预测结果确定为高风险。Step S520, when the number of times that the prediction period is determined to be a high-risk period is greater than a preset alarm number threshold, the prediction result is determined to be a high-risk period.
在一实施例中,采用告警数阈值能够有效减少误告警数,在实际预测的过程中,由于采用周期预测的方式,很可能某一天的数据异常导致预测结果为高风险,因此通过多次预测,能够有效减少偶发异常造成的预测结果偏差,具体的告警数阈值可以根据实际需求调整,在此不多作限定。In one embodiment, using the alarm number threshold can effectively reduce the number of false alarms. In the actual prediction process, due to the use of periodic prediction, it is very likely that abnormal data on a certain day will cause the prediction result to be high risk. Therefore, through multiple predictions , which can effectively reduce the deviation of prediction results caused by occasional anomalies. The specific alarm number threshold can be adjusted according to actual needs, which is not limited here.
在一实施例中,当确定预测结果为高风险,还可以根据待预测磁盘的基本信息生成告警信息,基本信息可以是磁盘的型号、安装的位置等,在此不多作限定。可以理解的是,告警信息可以通过图2所示的告警管理模块生成,例如向后台管理系统推送告警信息,并且在告警信息中携带磁盘的基本信息,以便于维护人员及时准确地进行磁盘的维护。In one embodiment, when it is determined that the prediction result is high risk, alarm information may also be generated according to basic information of the disk to be predicted. It can be understood that the alarm information can be generated by the alarm management module shown in Figure 2, for example, the alarm information is pushed to the background management system, and the basic information of the disk is carried in the alarm information, so that the maintenance personnel can timely and accurately perform the maintenance of the disk. .
另外,参照图6,在一实施例中,IO信息还包括IO时长和IO大小,待预测磁盘的缓存盘加速场景由以下步骤确定:In addition, referring to FIG. 6, in one embodiment, the IO information also includes the IO duration and the IO size, and the cache disk acceleration scenario of the disk to be predicted is determined by the following steps:
步骤S610,获取待预测磁盘的IOPS,根据待预测磁盘的IOPS和IO大小确定时长阈值;Step S610, obtaining the IOPS of the disk to be predicted, and determining the duration threshold according to the IOPS and IO size of the disk to be predicted;
步骤S620,当IO时长大于时长阈值,确定待预测磁盘处于缓存盘加速场景。Step S620, when the IO duration is greater than the duration threshold, it is determined that the disk to be predicted is in a cache disk acceleration scenario.
需要说明的是,待预测磁盘的IOPS可以通过任意方式获取,例如读取待预测磁盘的IOPS性能参数,也可以通过实际测试的方法对待预测磁盘进行若干次IO的测试所得,具体方式根据实际需求选取即可。It should be noted that the IOPS of the disk to be predicted can be obtained in any way, such as reading the IOPS performance parameters of the disk to be predicted, or it can be obtained by performing several IO tests on the disk to be predicted by the actual test method. The specific method is based on actual needs. Just select it.
可以理解的是,时长阈值可以通过IO大小除以IOPS所得,由于IOPS用于表征磁盘的读 写操作能力,因此时长阈值可以用于表征磁盘处理特定IO大小所需要的最小允许时间,当IO时长大于该最小允许时间,则可以确定该IO执行与缓存盘加速场景中。It can be understood that the duration threshold can be obtained by dividing the IO size by IOPS. Since IOPS is used to represent the read and write operation capability of the disk, the duration threshold can be used to represent the minimum allowable time required for the disk to process a specific IO size. When the IO duration is If it is greater than the minimum allowable time, it can be determined that the IO execution is in the scenario of cache disk acceleration.
值得注意的是,可以通过一个IOIt is worth noting that it is possible to pass an IO
另外,在一实施例中,SMART信息包括至少包括如下之一:In addition, in one embodiment, the SMART information includes at least one of the following:
累计启停次数;Cumulative start and stop times;
累计加载卸载次数;Cumulative loading and unloading times;
成长坏道数;number of bad sectors of growth;
非媒介错误计数;non-media error count;
不可修复的错误数。Number of unfixable errors.
在一实施例中,SMART信息可以包括任意可获取到的属性,例如磁盘健康评分(SMART Health Status)、累计启停次数(Accumulated start-stop cycles)、累计加载卸载次数(Accumulated load-unload cycles)、成长坏道数(Elements in grown defect list)、非媒介错误计数(Non-medium error count)和不可修复的错误数,其中,不可修复的错误数可以包括不可修复的读操作错误数(Total uncorrected read errors)、不可修复的写操作错误数(Total uncorrected write errors),本领域技术人员有动机根据实际需求增加或者减少具体的磁盘参数,在此不多作限定。In one embodiment, the SMART information may include any available attributes, such as disk health score (SMART Health Status), accumulated start-stop cycles (Accumulated start-stop cycles), accumulated load-unload cycles (Accumulated load-unload cycles) , the number of growing bad sectors (Elements in grown defective list), the count of non-medium errors (Non-medium error count) and the number of uncorrectable errors, where the number of uncorrectable errors can include the number of uncorrectable read errors (Total uncorrected read errors) and total uncorrected write errors (Total uncorrected write errors), those skilled in the art have the motivation to increase or decrease specific disk parameters according to actual needs, which are not limited here.
可以理解的是,基于上述磁盘参数,为了表征磁盘的失效风险,可以采用每个磁盘参数的变化率和增量值,磁盘参数的变化率可以是表征具体数值变化速度快慢的参数,磁盘参数的增量值可以是增量值的绝对值,能够用于表征磁盘参数的变化幅度即可,当磁盘参数的变化幅度越大,则磁盘失效的风险越大。It can be understood that, based on the above disk parameters, in order to characterize the failure risk of the disk, the rate of change and incremental value of each disk parameter can be used. The incremental value may be an absolute value of the incremental value, which can be used to characterize the variation range of the disk parameters. The greater the variation range of the disk parameters, the greater the risk of disk failure.
另外,参照图7,本申请实施例还提供了一种预测模型训练方法,包括但不限于有步骤S710和步骤S720。In addition, referring to FIG. 7 , an embodiment of the present application further provides a prediction model training method, including but not limited to step S710 and step S720.
步骤S710,获取训练样本磁盘的预测训练样本集,预测训练样本集包括若干个训练样本IO的训练样本IO信息和与训练样本IO相对应的训练样本SMART信息,其中,预测训练样本集采集于训练样本磁盘的缓存盘加速场景。Step S710, obtain the prediction training sample set of the training sample disk, the prediction training sample set includes the training sample IO information of several training samples 10 and the training sample SMART information corresponding to the training sample 10, wherein, the prediction training sample set is collected in the training sample 10. Cache disk acceleration scenarios for sample disks.
在一实施例中,预测训练样本集的获取方式可以通过IOPS性能模型从图1所示实施例中所述的缓存盘加速场景中获取,IOPS性能模型可以通过手动测试不同大块IO在不同队列深度下的IOPS性能得出。可以理解的是,IOPS性能模型能够用于表征磁盘的读写能力,因此,对于一定数量的IO,能够利用IOPS性能模型计算出预估的允许时间,即上述的时长阈值,当若干个IO的实际处理时长大于该时长阈值,则可以认为该IO来自于处于缓存盘加速场景中,可以确定为有效的样本。可以理解的是,预测训练样本集可以采用周期性采集的方式,例如每天采集一次,具体周期根据实际需求选取即可。In one embodiment, the acquisition method of the prediction training sample set can be obtained from the cache disk acceleration scenario described in the embodiment shown in FIG. 1 through the IOPS performance model. The IOPS performance model can be obtained by manually testing different large blocks of IO in different queues. The IOPS performance under depth is obtained. It is understandable that the IOPS performance model can be used to characterize the read and write capabilities of the disk. Therefore, for a certain number of IOs, the estimated allowable time can be calculated by using the IOPS performance model, that is, the above-mentioned duration threshold. If the actual processing duration is greater than the duration threshold, it can be considered that the IO comes from a cache disk acceleration scenario and can be determined as a valid sample. It can be understood that the prediction training sample set can be collected periodically, for example, collected once a day, and the specific period can be selected according to actual needs.
值得注意的是,预测训练样本采集于缓存盘加速场景的原理可以参考图2实施例所述的原理,在此不多作赘述。It is worth noting that, for the principle that the prediction training samples are collected in the cache disk acceleration scenario, reference may be made to the principle described in the embodiment of FIG. 2 , which will not be repeated here.
步骤S720,根据预测训练样本集训练预测模型。Step S720, train the prediction model according to the prediction training sample set.
在一实施例中,预测模型的训练可以是每天训练一次,也可以是根据实际需求调整,在此不多作限定。可以理解的是,当训练样本集包括若干个样本子集的情况下,可以根据样本子集分别进行训练,例如根据周期的不同采集了一周、两周和四周内的预测训练样本集,则分别针对对应的周期进行训练,从而使得预测模型能够对不同周期进行失效预测。In an embodiment, the training of the prediction model may be trained once a day, or may be adjusted according to actual needs, which is not limited herein. It can be understood that when the training sample set includes several sample subsets, training can be carried out according to the sample subsets. Training is performed for the corresponding period, so that the prediction model can perform failure prediction for different periods.
在一实施例中,预测模型可以采用常见的模型框架,例如LightGBM框架。需要说明的是,在对预测模型进行训练之前,还需要对模型的基本参数进行设置,例如当采用的模型框架为上述的LightGBM框架,可以按照下表1的方式设置框架参数:In one embodiment, the prediction model may use a common model framework, such as the LightGBM framework. It should be noted that before training the prediction model, the basic parameters of the model need to be set. For example, when the model framework used is the above-mentioned LightGBM framework, the framework parameters can be set according to the following table 1:
参数名称parameter name value
Learing rateLearning rate 0.350.35
Iteration roundsIteration rounds 110110
Cross validationCross validation 55
Total sample numberTotal sample number 51605160
Terminal conditionTerminal condition 10 -4 10 -4
表1模型框架参数配置表Table 1 Model framework parameter configuration table
另外,参照图8,在一实施例中,训练样本IO信息包括训练样本IO时长和训练样本IO大小,图7所示实施例中的步骤S710还包括但不限于有以下步骤:In addition, referring to FIG. 8 , in one embodiment, the training sample IO information includes the training sample IO duration and the training sample IO size, and step S710 in the embodiment shown in FIG. 7 also includes but is not limited to the following steps:
步骤S810,获取训练样本磁盘的IOPS,根据训练样本磁盘的IOPS和训练样本IO大小确定训练样本时长阈值;Step S810, obtaining the IOPS of the training sample disk, and determining the training sample duration threshold according to the IOPS of the training sample disk and the IO size of the training sample;
步骤S820,当训练样本IO时长大于训练样本时长阈值,确定训练样本磁盘处于缓存盘加速场景。Step S820, when the training sample IO duration is greater than the training sample duration threshold, it is determined that the training sample disk is in a cache disk acceleration scenario.
需要说明的是,确定训练样本磁盘处于缓存盘加速场景的原理可以参考图6所示实施例的描述,为了叙述简便在此不多作赘述。It should be noted that, for the principle of determining that the training sample disk is in the cache disk acceleration scenario, reference may be made to the description of the embodiment shown in FIG. 6 , which is not repeated here for the sake of simplicity.
另外,参照图9,在一实施例中,图8所示实施例中的步骤S810还包括但不限于有以下步骤:In addition, referring to FIG. 9 , in one embodiment, step S810 in the embodiment shown in FIG. 8 further includes but is not limited to the following steps:
步骤S910,确定训练样本磁盘在缓存盘加速场景中的全部IO为备选IO;Step S910, determining that all IOs of the training sample disk in the cache disk acceleration scenario are candidate IOs;
根据预设条件从训练样本磁盘的全部IO中确定训练样本IO,将训练样本IO的IO信息确定为训练样本IO信息;Determine the training sample IO from all the IOs of the training sample disk according to the preset conditions, and determine the IO information of the training sample IO as the training sample IO information;
步骤S920,根据预设条件从备选IO中确定训练样本IO,将训练样本IO的IO信息确定为训练样本IO信息;Step S920, determining the training sample 10 from the candidate 10 according to the preset condition, and determining the IO information of the training sample 10 as the training sample 10 information;
步骤S930,从训练样本磁盘的SMART信息中获取与训练样本IO相对应的训练样本SMART信息;Step S930, obtains the training sample SMART information corresponding to the training sample 10 from the SMART information of the training sample disk;
步骤S940,对训练样本IO信息和训练样本SMART信息进行预处理,并根据预处理后的训练样本IO信息和训练样本SMART信息生成预测训练样本集。Step S940, preprocess the training sample IO information and the training sample SMART information, and generate a prediction training sample set according to the preprocessed training sample IO information and the training sample SMART information.
需要说明的是,虽然缓存盘加速场景中大多数的IO为大块IO,但是并非所有IO都能够用于模型的训练,因此,需要先将缓存盘加速场景中的IO确定为备选IO,再根据预设条件从备选IO中筛选出训练样本IO。It should be noted that although most of the IOs in the cache disk acceleration scenario are large IOs, not all IOs can be used for model training. Therefore, the IOs in the cache disk acceleration scenario need to be determined as alternative IOs first. Then, the training sample IO is screened from the candidate IO according to the preset conditions.
在一实施例中,在获取到训练样本IO和训练样本SMART信息后,对训练样本进行的预处理可以检查训练样本合法性、检查训练样本是否满足时间要求,也可以根据实际需求增加或减少对应的操作,例如处理正负样本不平衡等,在此不再赘述。可以理解的是,检查训练样本的合法性,主要用于确保获取的备选IO是连续的,避免获取过程被中断的IO作为训练样本,例如在某个IO执行的过程中,磁盘发生断电,则导致该IO为不连续的IO,其IO信息存在较大的偏差,无法应用于训练,因此可以通过预处理对该类型的样本进行去除。可以理 解的是,检查训练样本是否满足时间要求,可以根据设置好的训练周期确定,例如设置好的最长的训练周期为四周,则对于四周以前的训练样本进行去除,以确保数据的时效性。In one embodiment, after obtaining the training sample IO and the training sample SMART information, the preprocessing performed on the training sample can check the validity of the training sample, check whether the training sample meets the time requirement, and can also increase or decrease the corresponding data according to actual needs. operations, such as handling the imbalance of positive and negative samples, etc., will not be repeated here. It is understandable that checking the legitimacy of training samples is mainly used to ensure that the candidate IOs obtained are continuous, and to avoid IOs whose acquisition process is interrupted as training samples. For example, during the execution of a certain IO, the disk is powered off. , then the IO is discontinuous IO, and its IO information has a large deviation, which cannot be applied to training, so this type of samples can be removed by preprocessing. It is understandable that checking whether the training samples meet the time requirements can be determined according to the set training period. For example, if the longest training period is set to be four weeks, then the training samples before four weeks will be removed to ensure the timeliness of the data. .
在一实施例中,在获取到训练样本IO和训练样本SMART信息后,还可以进行特征扩展,有利于加大数据的离散化,例如,在IO信息包括时延信息、状态信息、IO时间信息的基础上,对训练样本IO进行以下特征扩展:预先设置若干个时延段,例如0至32毫秒,32毫秒至64毫秒,64毫秒至128毫秒,128毫秒至512毫秒,>=512毫秒,根据训练样本IO信息中的时延信息,确定每个训练样本IO信息所处的时延段,并确定每个时延段的百分占比;磁盘健康度积分评估,将高时延段的比例进行适当权重加成,从而使更高时延段的比例占用更高的话语权,突出高时延带来的健康威胁;IO错误率,每个时延段中状态信息为IO错误的IO占整体IO的比例;每个时延段中按从大到小排序前N个的时延的平均值,N的数值可以根据实际需求选取。可以理解的是,SMART信息多为统计数据,因此可以对SMART信息中各统计数据进行变化率以及增加的绝对值的获取,以实现特征扩展,在此不多作赘述。In one embodiment, after the training sample IO and the training sample SMART information are obtained, feature expansion can also be performed, which is beneficial to increase the discretization of data. For example, the IO information includes time delay information, status information, and IO time information. On the basis of , the following feature expansion is performed on the training sample IO: several delay segments are preset, such as 0 to 32 ms, 32 ms to 64 ms, 64 ms to 128 ms, 128 ms to 512 ms, >= 512 ms, According to the delay information in the IO information of the training samples, determine the delay segment in which the IO information of each training sample is located, and determine the percentage of each delay segment. The proportion is appropriately weighted, so that the proportion of higher delay segments occupies a higher voice, highlighting the health threat caused by high delay; IO error rate, the status information in each delay segment is IO error IO The proportion of the overall IO; the average value of the first N delays in each delay segment in descending order, and the value of N can be selected according to actual needs. It can be understood that most of the SMART information is statistical data, so the rate of change and the absolute value of each statistical data in the SMART information can be obtained to achieve feature expansion, which is not repeated here.
另外,参考图10,在一实施例中,训练样本IO信息还包括状态信息、时延信息和IO时间信息,预设条件至少包括如下之一:In addition, referring to FIG. 10 , in one embodiment, the training sample IO information further includes state information, delay information and IO time information, and the preset condition includes at least one of the following:
状态信息为用于表征IO错误的错误状态;Status information is an error status used to characterize IO errors;
IO大小大于预设的IO大小阈值;The IO size is greater than the preset IO size threshold;
当前已被确定为训练样本IO的数量小于预设的数量阈值;It is currently determined that the number of training samples IO is less than the preset number threshold;
IO时间信息符合预先设置的样本采集周期;The IO time information conforms to the preset sample collection cycle;
时延信息满足预先设定的时延分布区间。The delay information satisfies a preset delay distribution interval.
在一实施例中,预设条件可以由图2所示的IO模块进行判断和数据筛选,例如可以根据采集周期,确定统计列表。为了叙述便利,以下结合图10对本实施例的预设条件进行举例说明:In one embodiment, the preset condition may be judged and data filtered by the IO module shown in FIG. 2 , for example, a statistics list may be determined according to the collection period. For the convenience of description, the preset conditions of this embodiment are illustrated below with reference to FIG. 10 :
在IO模块获取到备选IO后,首先判断备选IO的状态信息是否为IO错误,若是则直接加入统计列表,通过采集状态信息为IO错误的IO并用于训练,使得预测模型能够更加准确地预测出磁盘可能出现错误的概率。若备选IO的状态信息为IO正确,则对确定IO大小是否满足IO大小阈值进行判断,基于上述实施例的分析,大块IO的时延特性能够用于对磁盘失效进行预测,因此,本实施例可以只统计特定大小的IO,例如只统计在128K至512K范围内的IO,若备选IO大小大于IO大小阈值,则将该备选IO加入待选列表,否则对IO的数量进行判断,避免IO的采集数量不足。当已被采集的备选IO的数量超过数量阈值,则已经采集到足够的训练样本IO,此时可以将统计列表和待选列表中的备选IO确定为训练样本IO,而不采集当前的备选IO;若未超过数量阈值,则需要对当前的备选IO进行判断,例如通过备选IO的IO时间信息,确定该备选IO处于样本采集周期内,若否,则可以判断该备选IO的采集时间不满足采集周期,并且由于备选IO的采集是按照时间顺序采集,因此,此时可以确定备选IO的采集已经超期,则可以清理待选列表,并且停止备选IO的采集,若未超过采集周期,则该备选IO为有效的IO,将其加入统计列表,并将统计列表和待选列表中的备选IO确定为训练样本IO。After the IO module obtains the candidate IO, it first judges whether the status information of the candidate IO is an IO error, and if so, directly adds it to the statistics list, and collects the IO whose status information is an IO error and uses it for training, so that the prediction model can be more accurate. Predict the probability that the disk may be in error. If the status information of the candidate IO is correct, then determine whether the IO size satisfies the IO size threshold. Based on the analysis of the above embodiment, the delay characteristics of large block IO can be used to predict disk failure. Therefore, this The embodiment may only count IOs of a specific size, for example, only count IOs in the range of 128K to 512K. If the size of the candidate IO is greater than the IO size threshold, the candidate IO is added to the candidate list, otherwise the number of IOs is judged , to avoid insufficient collection of IO. When the number of candidate IOs that have been collected exceeds the number threshold, enough training sample IOs have been collected. At this time, the candidate IOs in the statistical list and the candidate IOs can be determined as training sample IOs without collecting the current IOs. The candidate IO; if the number threshold is not exceeded, the current candidate IO needs to be judged. For example, through the IO time information of the candidate IO, it is determined that the candidate IO is within the sample collection period. If not, the candidate IO can be judged. The collection time of the selected IO does not meet the collection period, and since the collection of the candidate IO is collected in chronological order, it can be determined that the collection of the candidate IO has expired at this time, and the candidate IO can be cleaned up and stopped. Collection, if the collection period does not exceed, the candidate IO is a valid IO, and it is added to the statistics list, and the candidate IO in the statistics list and the candidate list is determined as the training sample IO.
可以理解的是,预先设定的时延分布区间可以根据实际需求设置,例如上述实施例中的时延分布区间:0至32毫秒,32毫秒至64毫秒,64毫秒至128毫秒,128毫秒至512毫秒,>=512毫秒,当训练样本IO的时延信息满足上述时延分布区间,则可以进一步确定为可用的训练样 本IO。It can be understood that the preset delay distribution interval can be set according to actual needs, for example, the delay distribution interval in the above embodiment: 0 to 32 milliseconds, 32 milliseconds to 64 milliseconds, 64 milliseconds to 128 milliseconds, and 128 milliseconds to 512 milliseconds, >= 512 milliseconds, when the delay information of the training sample IO satisfies the above delay distribution interval, it can be further determined as an available training sample IO.
另外,参照图11,在一实施例中,图7所示实施例中的步骤S720还包括但不限于有以下步骤:In addition, referring to FIG. 11 , in an embodiment, step S720 in the embodiment shown in FIG. 7 further includes but is not limited to the following steps:
步骤S1110,获取预先设置的训练周期,根据训练样本IO的IO时间信息确定与训练周期相对应的周期样本集;Step S1110, obtaining a preset training period, and determining a period sample set corresponding to the training period according to the IO time information of the training sample IO;
步骤S1120,根据周期样本集训练预测模型。Step S1120, train the prediction model according to the periodic sample set.
在一实施例中,训练周期可以根据实际需求选取,例如根据当前的时间,获取前第一周、前二周和前四周的训练样本作为周期样本集,使得预测模型可以根据不同的预测周期进行磁盘失效预测。In one embodiment, the training period can be selected according to actual needs, for example, according to the current time, the training samples of the first week, the first two weeks and the first four weeks are obtained as the period sample set, so that the prediction model can be performed according to different prediction periods. Disk failure prediction.
可以理解的是,在确定训练周期后,进行磁盘失效预测时可以根据相同的周期进行预测数据集的采集,从而获取出对应周期内的预测结果。It can be understood that, after the training period is determined, the prediction data set can be collected according to the same period when performing disk failure prediction, so as to obtain the prediction result in the corresponding period.
另外,在一实施例中,训练样本SMART信息至少包括如下之一:In addition, in one embodiment, the training sample SMART information includes at least one of the following:
累计启停次数;Cumulative start and stop times;
累计加载卸载次数;Cumulative loading and unloading times;
成长坏道数;number of bad sectors of growth;
非媒介错误计数;non-media error count;
不可修复的错误数。Number of unfixable errors.
需要说明的是,对于训练样本SMART信息的选取,可以参考上述磁盘失效预测方法中SMART信息的选取原理,为了叙述简便在此不再赘述。It should be noted that, for the selection of the SMART information of the training samples, reference may be made to the selection principle of the SMART information in the above-mentioned disk failure prediction method, which is not repeated here for the sake of simplicity.
另外,参照图12,在一实施例中,图7所示实施例中的步骤S720还包括但不限于有以下步骤:In addition, referring to FIG. 12 , in an embodiment, step S720 in the embodiment shown in FIG. 7 further includes but is not limited to the following steps:
步骤S1210,按照预设比例从预测训练样本集中分割出训练样本集和测试样本集;Step S1210, dividing a training sample set and a test sample set from the prediction training sample set according to a preset ratio;
步骤S1220,根据训练样本集训练预测模型,并根据测试样本集对训练后的预测模型进行验证测试。Step S1220: Train the prediction model according to the training sample set, and perform a verification test on the trained prediction model according to the test sample set.
在一实施例中,预设比例可以是任意数值,根据实际需求调整即可,例如按照8:2的比例进行训练样本集和测试样本集的分割。In one embodiment, the preset ratio may be any value, which can be adjusted according to actual requirements, for example, the training sample set and the test sample set are divided according to the ratio of 8:2.
需要说明的是,上述实施例中的特征扩展操作可以在分割预测训练样本集之前,也可以在分割之后,本实施例对此不多作限定。It should be noted that, the feature expansion operation in the foregoing embodiment may be performed before or after segmentation of the prediction training sample set, which is not limited in this embodiment.
需要说明的是,通过测试样本集对预测模型进行验证测试时,可以采用常见的测试指标并设定阈值进行判断,例如错误发现率(False Discovery Rate,FDR)、错误接受率(False Accept Rate FAR),具体的阈值设定标准可以根据实际需求调整,在此不多作限定。It should be noted that when verifying the prediction model through the test sample set, common test indicators can be used and thresholds can be set for judgment, such as false discovery rate (False Discovery Rate, FDR), false acceptance rate (False Accept Rate FAR) ), the specific threshold setting standard can be adjusted according to actual needs, which is not limited here.
另外,参考图13,本申请的一个实施例还提供了一种电子设备,该电子设备1300包括:存储器1310、处理器1320及存储在存储器1310上并可在处理器1320上运行的计算机程序。In addition, referring to FIG. 13 , an embodiment of the present application further provides an electronic device, the electronic device 1300 includes: a memory 1310 , a processor 1320 , and a computer program stored on the memory 1310 and executable on the processor 1320 .
处理器1320和存储器1310可以通过总线或者其他方式连接。The processor 1320 and the memory 1310 may be connected by a bus or otherwise.
实现上述实施例的磁盘失效预测方法所需的非暂态软件程序以及指令存储在存储器1310中,当被处理器1320执行时,执行上述实施例中的应用于电子设备1300的磁盘失效预测方法,例如,执行以上描述的图1中的方法步骤S110至步骤S120,图3中的方法步骤S310至步骤S330,图4中的方法步骤S410至步骤S420,图5中的方法步骤S510至步骤S520,图6中的方法步骤S610至步骤S620,图7中的方法步骤S710至步骤S720,图8中的方法步骤 S810至步骤S820,图9中的方法步骤S910至步骤S940,图11中的方法步骤S1110至步骤S1120,图12中的方法步骤S1210至步骤S1220。The non-transitory software programs and instructions required to implement the disk failure prediction method of the above embodiment are stored in the memory 1310, and when executed by the processor 1320, the disk failure prediction method applied to the electronic device 1300 in the above embodiment is executed, For example, performing the above-described method steps S110 to S120 in FIG. 1 , method steps S310 to S330 in FIG. 3 , method steps S410 to S420 in FIG. 4 , and method steps S510 to S520 in FIG. 5 , Method steps S610 to S620 in FIG. 6 , method steps S710 to S720 in FIG. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 to S940 in FIG. 9 , method steps in FIG. 11 Steps S1110 to S1120, steps S1210 to S1220 of the method in FIG. 12 .
本申请实施例包括:获取待预测磁盘的预测数据集,所述预测数据集包括预测样本IO的IO信息和与所述预测样本IO相对应的SMART信息,其中,所述预测数据集采集于所述待预测磁盘的缓存盘加速场景;将所述预测数据集输入至预先训练好的预测模型,得出所述待预测磁盘的预测结果。根据本申请实施例提供的方案,能够结合IO信息和SMART信息,对所有类型的磁盘进行磁盘失效预测,有效降低了数据丢失的风险。The embodiment of the present application includes: acquiring a prediction data set of a disk to be predicted, the prediction data set including IO information of the prediction sample 10 and SMART information corresponding to the prediction sample 10, wherein the prediction data set is collected in the Describe the cache disk acceleration scene of the disk to be predicted; input the predicted data set into a pre-trained prediction model, and obtain the prediction result of the disk to be predicted. According to the solution provided by the embodiment of the present application, the disk failure prediction can be performed for all types of disks in combination with IO information and SMART information, which effectively reduces the risk of data loss.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The apparatus embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述电子设备实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的应用于电子设备的磁盘失效预测方法,例如,执行以上描述的图1中的方法步骤S110至步骤S120,图3中的方法步骤S310至步骤S330,图4中的方法步骤S410至步骤S420,图5中的方法步骤S510至步骤S520,图6中的方法步骤S610至步骤S620,图7中的方法步骤S710至步骤S720,图8中的方法步骤S810至步骤S820,图9中的方法步骤S910至步骤S940,图11中的方法步骤S1110至步骤S1120,图12中的方法步骤S1210至步骤S1220。本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。In addition, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or controller, for example, by the above-mentioned Executed by a processor in the embodiment of the electronic device, the above-mentioned processor can execute the disk failure prediction method applied to the electronic device in the above-mentioned embodiment, for example, execute the above-described method steps S110 to S120 in FIG. 1 . The method steps S310 to S330 in FIG. 3, the method steps S410 to S420 in FIG. 4, the method steps S510 to S520 in FIG. 5, the method steps S610 to S620 in FIG. 6, and the method step S710 in FIG. 7 To step S720, the method steps S810 to S820 in FIG. 8, the method steps S910 to S940 in FIG. 9, the method steps S1110 to S1120 in FIG. 11, and the method steps S1210 to S1220 in FIG. 12. Those of ordinary skill in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .
以上是对本申请的一些实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请范围的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of some implementations of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or replacements without departing from the scope of the present application. These equivalents Variations or substitutions of the above are all included within the scope defined by the claims of the present application.

Claims (15)

  1. 一种磁盘失效预测方法,包括:A disk failure prediction method, comprising:
    获取待预测磁盘的预测数据集,所述预测数据集包括预测样本IO的IO信息和与所述预测样本IO相对应的SMART信息,其中,所述预测数据集采集于所述待预测磁盘的缓存盘加速场景;Obtain the predicted data set of the disk to be predicted, the predicted data set includes the IO information of the predicted sample IO and the SMART information corresponding to the predicted sample IO, wherein the predicted data set is collected in the cache of the to-be-predicted disk Disk acceleration scene;
    将所述预测数据集输入至预先训练好的预测模型,得出所述待预测磁盘的预测结果。The prediction data set is input into the pre-trained prediction model, and the prediction result of the disk to be predicted is obtained.
  2. 根据权利要求1所述的方法,其中,所述IO信息还包括IO时间信息,所述将所述预测数据集输入至预先训练好的预测模型,得出所述待预测磁盘的预测结果,包括:The method according to claim 1, wherein the IO information further includes IO time information, and the prediction result of the disk to be predicted is obtained by inputting the prediction data set into a pre-trained prediction model, comprising: :
    确定预测周期,根据所述预测周期和所述IO时间信息从所述预测数据集中确定周期数据集;determining a forecast period, and determining a period data set from the forecast data set according to the forecast period and the IO time information;
    根据所述周期数据集和所述预测模型,得出所述待预测磁盘在所述预测周期中的周期失效概率;According to the cycle data set and the prediction model, obtain the cycle failure probability of the disk to be predicted in the prediction cycle;
    根据所述周期失效概率确定所述待预测磁盘的预测结果。The prediction result of the disk to be predicted is determined according to the periodic failure probability.
  3. 根据权利要求2所述的方法,其中,所述根据所述周期失效概率确定所述待预测磁盘的预测结果,包括:The method according to claim 2, wherein the determining the prediction result of the disk to be predicted according to the periodic failure probability comprises:
    当所述周期失效概率大于与所述预测周期相对应的预设的概率阈值,将所述预测结果确定为高风险;When the cycle failure probability is greater than a preset probability threshold corresponding to the prediction cycle, determining the prediction result as a high risk;
    当所述周期失效概率小于或等于所述概率阈值,将所述预测结果确定为低风险。When the periodic failure probability is less than or equal to the probability threshold, the predicted result is determined to be low risk.
  4. 根据权利要求3所述的方法,其中,所述当所述周期失效概率大于所述概率阈值,将所述预测结果确定为高风险,包括:The method according to claim 3, wherein, when the periodic failure probability is greater than the probability threshold, determining the prediction result as a high risk comprises:
    当所述周期失效概率大于所述概率阈值,确定所述周期失效概率所对应的预测周期为高风险周期;When the periodic failure probability is greater than the probability threshold, determine that the prediction period corresponding to the periodic failure probability is a high-risk period;
    当所述预测周期被确定为高风险周期的次数大于预先设定的告警数阈值,将所述预测结果确定为高风险。When the number of times that the prediction period is determined to be a high-risk period is greater than a preset alarm number threshold, the prediction result is determined to be a high-risk period.
  5. 根据权利要求1所述的方法,其中,所述IO信息还包括IO时长和IO大小,所述待预测磁盘的缓存盘加速场景由以下步骤确定:The method according to claim 1, wherein the IO information further includes IO duration and IO size, and the cache disk acceleration scenario of the to-be-predicted disk is determined by the following steps:
    获取所述待预测磁盘的IOPS,根据所述待预测磁盘的IOPS和所述IO大小确定时长阈值;Obtain the IOPS of the disk to be predicted, and determine the duration threshold according to the IOPS of the disk to be predicted and the IO size;
    当所述IO时长大于所述时长阈值,确定所述待预测磁盘处于缓存盘加速场景。When the IO duration is greater than the duration threshold, it is determined that the disk to be predicted is in a cache disk acceleration scenario.
  6. 根据权利要求1所述的方法,其中,所述SMART信息至少包括如下之一:The method of claim 1, wherein the SMART information includes at least one of the following:
    累计启停次数;Cumulative start and stop times;
    累计加载卸载次数;Cumulative loading and unloading times;
    成长坏道数;number of bad sectors of growth;
    非媒介错误计数;non-media error count;
    不可修复的错误数。Number of unfixable errors.
  7. 一种预测模型训练方法,包括:A predictive model training method, comprising:
    获取训练样本磁盘的预测训练样本集,所述预测训练样本集包括训练样本IO的训练样本IO信息和与所述训练样本IO相对应的训练样本SMART信息,其中,所述预测训练样本集采集于所述训练样本磁盘的缓存盘加速场景;Obtain the prediction training sample set of the training sample disk, the prediction training sample set includes the training sample IO information of the training sample 10 and the training sample SMART information corresponding to the training sample 10, wherein the prediction training sample set is collected in The cache disk acceleration scene of the training sample disk;
    根据所述预测训练样本集训练所述预测模型。The prediction model is trained according to the prediction training sample set.
  8. 根据权利要求7所述的方法,其中,所述训练样本IO信息包括训练样本IO时长和训练样本IO大小,所述训练样本磁盘的缓存盘加速场景由以下步骤确定:The method according to claim 7, wherein the training sample IO information includes the training sample IO duration and the training sample IO size, and the cache disk acceleration scene of the training sample disk is determined by the following steps:
    获取所述训练样本磁盘的IOPS,根据所述训练样本磁盘的IOPS和所述训练样本IO大小确定训练样本时长阈值;Obtain the IOPS of the training sample disk, and determine the training sample duration threshold according to the IOPS of the training sample disk and the IO size of the training sample;
    当所述训练样本IO时长大于所述训练样本时长阈值,确定所述训练样本磁盘处于缓存盘加速场景。When the IO duration of the training sample is greater than the training sample duration threshold, it is determined that the training sample disk is in a cache disk acceleration scenario.
  9. 根据权利要求8所述的方法,其中,所述获取训练样本磁盘的预测训练样本集,包括:The method according to claim 8, wherein the obtaining the prediction training sample set of the training sample disk comprises:
    确定所述训练样本磁盘在所述缓存盘加速场景中的全部IO为备选IO;Determine that all IOs of the training sample disk in the cache disk acceleration scenario are candidate IOs;
    根据所述预设条件从所述备选IO中确定训练样本IO,将所述训练样本IO的IO信息确定为训练样本IO信息;Determine the training sample 10 from the candidate 10 according to the preset condition, and determine the IO information of the training sample 10 as the training sample 10 information;
    从所述训练样本磁盘的SMART信息中获取与所述训练样本IO相对应的训练样本SMART信息;Obtain the training sample SMART information corresponding to the training sample 10 from the SMART information of the training sample disk;
    对所述训练样本IO信息和所述训练样本SMART信息进行预处理,并根据预处理后的所述训练样本IO信息和所述训练样本SMART信息生成预测训练样本集。The training sample IO information and the training sample SMART information are preprocessed, and a prediction training sample set is generated according to the preprocessed training sample IO information and the training sample SMART information.
  10. 根据权利要求9所述的方法,其中,所述训练样本IO信息还包括状态信息、时延信息和IO时间信息,所述预设条件至少包括如下之一:The method according to claim 9, wherein the training sample IO information further includes state information, delay information and IO time information, and the preset condition includes at least one of the following:
    所述状态信息为用于表征IO错误的错误状态;The state information is an error state used to characterize IO errors;
    所述IO大小大于预设的IO大小阈值;The IO size is greater than the preset IO size threshold;
    当前已被确定为训练样本IO的数量小于预设的数量阈值;It is currently determined that the number of training samples IO is less than the preset number threshold;
    所述IO时间信息符合预先设置的样本采集周期;The IO time information conforms to the preset sample collection period;
    所述时延信息满足预先设定的时延分布区间。The delay information satisfies a preset delay distribution interval.
  11. 根据权利要求10所述的方法,其中,所述根据所述预测训练样本集训练所述预测模型,还包括:The method according to claim 10, wherein the training the prediction model according to the prediction training sample set further comprises:
    获取预先设置的训练周期,根据所述训练样本I O的所述IO时间信息确定与所述训练周期相对应的周期样本集;Obtain a preset training period, and determine a period sample set corresponding to the training period according to the IO time information of the training sample 10;
    根据所述周期样本集训练所述预测模型。The prediction model is trained based on the periodic sample set.
  12. 根据权利要求7至9任意一项所述的方法,其中,所述训练样本SMART信息至少包括如下之一:The method according to any one of claims 7 to 9, wherein the training sample SMART information includes at least one of the following:
    累计启停次数;Cumulative start and stop times;
    累计加载卸载次数;Cumulative loading and unloading times;
    成长坏道数;number of bad sectors of growth;
    非媒介错误计数;non-media error count;
    不可修复的错误数。Number of unfixable errors.
  13. 根据权利要求8所述的方法,其中,所述根据所述预测训练样本集训练所述预测模型,还包括:The method according to claim 8, wherein the training the prediction model according to the prediction training sample set further comprises:
    按照预设比例从所述预测训练样本集中分割出训练样本集和测试样本集;Splitting the training sample set and the test sample set from the prediction training sample set according to a preset ratio;
    根据所述训练样本集训练所述预测模型,并根据所述测试样本集对训练后的所述预测模型进行验证测试。The prediction model is trained according to the training sample set, and a verification test is performed on the trained prediction model according to the test sample set.
  14. 一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至6中任意一项所述的磁盘失效预测方法,或者执行如权利要求7至13中任意一项所述的预测模型训练方法。An electronic device, comprising: a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements any one of claims 1 to 6 when the processor executes the computer program The disk failure prediction method, or the prediction model training method according to any one of claims 7 to 13 is performed.
  15. 一种计算机可读存储介质,存储有计算机可执行指令,其中,所述计算机可执行指令用于执行如权利要求1至6中任意一项所述的磁盘失效预测方法,或者执行如权利要求7至13中任意一项所述的预测模型训练方法。A computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions are used to execute the disk failure prediction method as claimed in any one of claims 1 to 6, or to execute the method as claimed in claim 7 The prediction model training method described in any one of to 13.
PCT/CN2021/133728 2020-12-03 2021-11-26 Magnetic disk failure prediction method, prediction model training method, and electronic device WO2022116922A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011394121.0A CN114595085A (en) 2020-12-03 2020-12-03 Disk failure prediction method, prediction model training method and electronic equipment
CN202011394121.0 2020-12-03

Publications (1)

Publication Number Publication Date
WO2022116922A1 true WO2022116922A1 (en) 2022-06-09

Family

ID=81813354

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133728 WO2022116922A1 (en) 2020-12-03 2021-11-26 Magnetic disk failure prediction method, prediction model training method, and electronic device

Country Status (2)

Country Link
CN (1) CN114595085A (en)
WO (1) WO2022116922A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115145494A (en) * 2022-08-11 2022-10-04 江苏臻云技术有限公司 Disk capacity prediction system and method based on big data time series analysis
CN116259337A (en) * 2023-05-15 2023-06-13 合肥联宝信息技术有限公司 Disk abnormality detection method, model training method and related device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822657B (en) * 2023-08-25 2024-01-09 之江实验室 Method and device for accelerating model training, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344267A1 (en) * 2016-05-27 2017-11-30 Netapp, Inc. Methods for proactive prediction of disk failure in the disk maintenance pipeline and devices thereof
CN109376905A (en) * 2018-09-20 2019-02-22 广东亿迅科技有限公司 Disk space prediction technique, device, computer equipment and storage medium
CN109828869A (en) * 2018-12-05 2019-05-31 中兴通讯股份有限公司 Predict the method, apparatus and storage medium of hard disk failure time of origin
CN110389866A (en) * 2018-04-20 2019-10-29 武汉安天信息技术有限责任公司 Disk failure prediction technique, device, computer equipment and computer storage medium
CN111581072A (en) * 2020-05-12 2020-08-25 国网安徽省电力有限公司信息通信分公司 Disk failure prediction method based on SMART and performance log
CN112433896A (en) * 2020-11-05 2021-03-02 北京浪潮数据技术有限公司 Server disk failure prediction method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344267A1 (en) * 2016-05-27 2017-11-30 Netapp, Inc. Methods for proactive prediction of disk failure in the disk maintenance pipeline and devices thereof
CN110389866A (en) * 2018-04-20 2019-10-29 武汉安天信息技术有限责任公司 Disk failure prediction technique, device, computer equipment and computer storage medium
CN109376905A (en) * 2018-09-20 2019-02-22 广东亿迅科技有限公司 Disk space prediction technique, device, computer equipment and storage medium
CN109828869A (en) * 2018-12-05 2019-05-31 中兴通讯股份有限公司 Predict the method, apparatus and storage medium of hard disk failure time of origin
CN111581072A (en) * 2020-05-12 2020-08-25 国网安徽省电力有限公司信息通信分公司 Disk failure prediction method based on SMART and performance log
CN112433896A (en) * 2020-11-05 2021-03-02 北京浪潮数据技术有限公司 Server disk failure prediction method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115145494A (en) * 2022-08-11 2022-10-04 江苏臻云技术有限公司 Disk capacity prediction system and method based on big data time series analysis
CN115145494B (en) * 2022-08-11 2023-09-15 江苏臻云技术有限公司 Disk capacity prediction system and method based on big data time sequence analysis
CN116259337A (en) * 2023-05-15 2023-06-13 合肥联宝信息技术有限公司 Disk abnormality detection method, model training method and related device
CN116259337B (en) * 2023-05-15 2023-09-05 合肥联宝信息技术有限公司 Disk abnormality detection method, model training method and related device

Also Published As

Publication number Publication date
CN114595085A (en) 2022-06-07

Similar Documents

Publication Publication Date Title
WO2022116922A1 (en) Magnetic disk failure prediction method, prediction model training method, and electronic device
CN109828869B (en) Method, device and storage medium for predicting hard disk fault occurrence time
Jin et al. Nevermind, the problem is already fixed: proactively detecting and troubleshooting customer dsl problems
US20100109860A1 (en) Identifying Redundant Alarms by Determining Coefficients of Correlation Between Alarm Categories
CN112988550B (en) Server failure prediction method, device and computer readable medium
CN112148561B (en) Method and device for predicting running state of business system and server
CN116611797B (en) Service tracking and monitoring method, system and storage medium
CN111464376A (en) Website availability monitoring method and device, storage medium and computer equipment
CN111309502A (en) Solid state disk service life prediction method
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
CN110175100B (en) Storage disk fault prediction method and prediction system
CN106708648B (en) A kind of the storage method of calibration and system of text data
CN116682479A (en) Method and system for testing enterprise-level solid state disk time delay index
CN115480948A (en) Hard disk failure prediction method and related equipment
CN113419885B (en) Data integrity processing method and device and electronic equipment
CN115470059A (en) Disk detection method, device, equipment and storage medium
CN114389962A (en) Broadband loss user determination method and device, electronic equipment and storage medium
CN114661505A (en) Storage component fault processing method, device, equipment and storage medium
CN113676377A (en) Online user number evaluation method, device, equipment and medium based on big data
CN112860527A (en) Fault monitoring method and device of application server
CN111752786A (en) Data storage method, data summarization method, equipment and medium in pressure test process
CN111506422A (en) Event analysis method and system
CN115686381B (en) Prediction method and device for storage cluster running state
CN113568822B (en) Service resource monitoring method, device, computing equipment and storage medium
CN111338917B (en) Dynamic control method and device for determining server service capability

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.10.2023)