WO2022116922A1

WO2022116922A1 - Magnetic disk failure prediction method, prediction model training method, and electronic device

Info

Publication number: WO2022116922A1
Application number: PCT/CN2021/133728
Authority: WO
Inventors: 宋顺
Original assignee: 中兴通讯股份有限公司
Priority date: 2020-12-03
Filing date: 2021-11-26
Publication date: 2022-06-09
Also published as: CN114595085A

Abstract

A magnetic disk failure prediction method, a prediction model training method, and an electronic device. The magnetic disk failure prediction method comprises: acquiring a prediction data set of a magnetic disk to be predicted, wherein the prediction data set comprises IO information of a prediction sample IO and SMART information corresponding to the prediction sample IO, and the prediction data set is collected from a cache disk acceleration scenario of said magnetic disk (S110); and inputting the prediction data set into a pre-trained prediction model, so as to obtain a prediction result of said magnetic disk (S120).

Description

Disk failure prediction method, prediction model training method, electronic equipment

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is based on the Chinese patent application with the application number of 202011394121.0 and the filing date of December 3, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.

technical field

The present application relates to, but is not limited to, the field of data storage, and in particular, relates to a method for predicting disk failure, a method for training a prediction model, and an electronic device.

Background technique

With the development of network technology and communication technology, the amount of data storage in server data centers increases rapidly. Disks are important hardware devices for data storage. For larger data centers, there are usually more disks. Disks typically have a limited lifespan, and at the end of their useful life, the chance of disk damage increases dramatically. In order to solve this problem, the replication technology or erasure coding technology is usually used for data redundancy, but it can only avoid data loss caused by the failure of a single disk. When multiple disks fail at the same time, there is still a risk of data loss.

Based on this, it is usually necessary to predict the failure of the disk during the operation of the disk, and replace the disk in time when a high failure risk is detected, thereby reducing the risk of data loss. A common practice is to use a trained prediction model for failure prediction, but the training data used by the existing prediction model is usually the self-monitoring analysis and reporting technology (SMART) information of the disk, but, SMART information can only be applied to serial advanced technology (Serial Advanced Technology Attachment, SATA) mechanical disks with many types of disk parameters. System Interface (Small Computer System Interface, SCSI) disks cannot make accurate predictions.

SUMMARY OF THE INVENTION

The following is an overview of the topics detailed in this article. This summary is not intended to limit the scope of protection of the claims.

Embodiments of the present application provide a disk failure prediction method, a prediction model training method, and an electronic device.

In a first aspect, an embodiment of the present application provides a method for predicting disk failure, including: acquiring a prediction data set of a disk to be predicted, where the prediction data set includes IO information of the input output (IO) of the prediction sample and the same The SMART information corresponding to the predicted sample 10, wherein, the predicted data set is collected in the cache disk acceleration scene of the to-be-predicted disk; the predicted data set is input into the pre-trained prediction model, and the described Predict the prediction result of the disk.

In a second aspect, an embodiment of the present application further provides a method for training a prediction model, including: acquiring a prediction training sample set of a training sample disk, where the prediction training sample set includes training sample IO information of the training sample IO and training sample IO information related to the training sample The SMART information of the training sample corresponding to the sample 10, wherein the prediction training sample set is collected in the cache disk acceleration scene of the training sample disk; the prediction model is trained according to the prediction training sample set.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the following when executing the computer program. The disk failure prediction method described in the first aspect, or the prediction model training method described in the second aspect is performed.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to execute the disk failure prediction method as described in the first aspect, or to execute the method as described in the first aspect. The prediction model training method described in the second aspect.

Other features and advantages of the present application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the description, claims and drawings.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solutions of the present application, and constitute a part of the specification. They are used to explain the technical solutions of the present application together with the embodiments of the present application, and do not constitute a limitation on the technical solutions of the present application.

1 is a flowchart of a method for predicting disk failure provided by an embodiment of the present application;

2 is a schematic diagram of a module framework provided by another embodiment of the present application;

3 is a flowchart of determining a prediction result according to a prediction period provided by another embodiment of the present application;

4 is a flowchart of determining a prediction result according to a periodic failure probability provided by another embodiment of the present application;

5 is a flowchart of determining a prediction result according to the number of times determined to be high-risk periods provided by another embodiment of the present application;

6 is a flowchart of determining that a disk to be predicted is in a cache disk acceleration scenario provided by another embodiment of the present application;

7 is a flowchart of a predictive model training method provided by another embodiment of the present application;

8 is a flowchart of determining that a training sample disk is in a cache disk acceleration scenario provided by another embodiment of the present application;

9 is a flowchart of obtaining a prediction training sample set provided by another embodiment of the present application;

10 is a flowchart of determining training sample IO information according to preset conditions provided by another embodiment of the present application;

11 is a flowchart of training a prediction model according to a training period provided by another embodiment of the present application;

12 is a flowchart of dividing the prediction sample training set into a training sample set and a test sample set provided by another embodiment of the present application;

FIG. 13 is a schematic structural diagram of an electronic device provided by another embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, the modules may be divided differently from the device, or executed in the order in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification, claims or the above drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

The present application provides a method for predicting disk failure, a method for training a prediction model, and an electronic device. The method for predicting disk failure includes: acquiring a prediction data set of a disk to be predicted, where the prediction data set includes IO information of a predicted sample IO and an The SMART information corresponding to the predicted sample 10, wherein, the predicted data set is collected in the cache disk acceleration scene of the to-be-predicted disk; the predicted data set is input into the pre-trained prediction model, and the described Predict the prediction result of the disk. According to the solution provided by the embodiment of the present application, the disk failure prediction can be performed for all types of disks in combination with IO information and SMART information, which effectively reduces the risk of data loss.

The embodiments of the present application will be further described below with reference to the accompanying drawings.

As shown in FIG. 1 , FIG. 1 is a flowchart of a method for predicting a disk failure provided by an embodiment of the present application. The method for predicting a disk failure includes, but is not limited to, steps S110 and S120.

Step S110, obtaining the prediction data set of the disk to be predicted, the prediction data set including the IO information of the predicted sample IO and the SMART information corresponding to the predicted sample IO, wherein the predicted data set is collected from the cache disk acceleration scene of the to-be-predicted disk.

It should be noted that the IO information of each predicted sample IO of the disk to be predicted includes multiple attributes such as IO delay, IO size, and IO status information. Therefore, using IO information as the input of the prediction model can effectively alleviate the problem of insufficient attributes.

It should be noted that, in this embodiment, the minimum allowable time for predicting the sample IO can be determined by the number of read and write operations per second (Input Output per second, IOPS) of the disk, and the minimum allowable time is determined as the duration threshold. The sum of the durations is greater than the duration threshold, and it can be determined that the predicted sample IOs belong to large-block IOs, that is, the disk working scenarios corresponding to IOs belong to the cache disk acceleration scenarios. It is worth noting that in the scenario of cache disk acceleration, it is easy to know that the IO of the issued disk tends to be large-block read and write, and the small-block IO is less. At the same time, the block layer usually has a maximum IO of 512K, so the range of IO size is relatively large. Small. This provides a basis for tracking the IO size of the disk. Those skilled in the art can understand that, in a storage system, in order to ensure the quality of storage services, usually front-end applications will perform service capability matching, and the storage side will set the front-end quality of service (Quality of Service, QOS), back-end QOS, etc. QOS effectively prevents IO bursts in most cases, and avoids that the IO queue depth is too large and the disk load is too heavy to provide stable services. To sum up, in the scenario of cache disk acceleration, it is of obvious significance to count the delay information of large IO blocks under a certain load, and provides a wealth of disk status identifiers from the application level. Therefore, the IO in the cache disk acceleration scenario The information can be used for failure prediction.

It is worth noting that the prediction data set in this embodiment includes both IO information and corresponding SMART information. The above-mentioned corresponding SMART information can be the SMART information in the process of executing the predicted sample IO, or it can be based on the collected SMART information. The SMART information is collected periodically, for example, once a day, and the specific collection method and period can be adjusted according to the actual situation, so that the SMART information and the IO information have a certain correlation.

It should be noted that due to the different physical properties of different types of disks, their failure criteria are different. When training the prediction model, the disk type information can be used as one of the selection features, so that the prediction results obtained by the prediction model can represent the Based on the failure risk of this type of disk, the disk type information can also be obtained as the input of the prediction model, so that the prediction model can predict the failure of different types of disks. It can be understood that, the disk type information may include a disk manufacturer, a disk model, a disk capacity, a disk serial number, and a rotation speed, which is not limited in this embodiment.

In step S120, the prediction data set is input into the pre-trained prediction model, and the prediction result of the disk to be predicted is obtained.

In one embodiment, the specific forecast timing can be determined once a day according to operational requirements to reduce the risk of data loss, or can be forecast after each IO information and SMART information is collected, which can be adjusted according to actual needs.

It is worth noting that because the IO path is not only affected by the disk itself, but also by the controller, expansion cards, cables and even the operating system. IO information is the final comprehensive information. It needs to be combined with some information of the disk itself to make a comprehensive judgment. It cannot be used for prediction alone. Otherwise, it is easy to give false alarms. The SMART information is information representing the state of the disk parameters, so combining the characteristics of the IO information and the characteristics of the SMART information can more accurately predict the failure risk of the disk to be predicted.

It can be understood that, when the prediction model is used to obtain the prediction results, the prediction model can be trained according to the IO information and SMART information in advance. For example, the delay information in the IO information, the rate of change of each parameter in the SMART information, and the SMART information can be used. The absolute value of the increase in the rate of change of each parameter is used as a training feature, and the feature is labeled and then input to the prediction module for training, so that the prediction model can obtain the prediction result of the disk to be predicted according to the above features.

In one embodiment, the prediction result can be in any form, such as the current failure risk probability of the disk to be predicted, or a specific risk value, etc., or the corresponding prediction result can be determined according to a specific collection period, such as a prediction data set. It is the data within one week, and the predicted result is the risk probability of failure in the next week, which can reflect the failure risk of the disk to be predicted, which is not limited here.

In addition, in an embodiment, a system architecture diagram for applying the disk failure prediction method of the present application may be as shown in FIG. 2 , which includes a prediction center and several proxy nodes, and both the prediction center and the proxy nodes may use electronic devices or servers form, which is not limited here. The prediction center may include an alarm management module, a prediction module, and a disk information management module, wherein the alarm management module is configured to issue an alarm prompt when it is detected that the disk is at high risk, and the prediction module is configured to perform the prediction of the disk to be predicted according to the prediction data set. For failure risk prediction, the disk information management module is configured to receive and manage the IO information and SMART information sent by the agent node, and form a prediction data set. The proxy node includes an IO module and a SMART module, wherein the IO module is configured to obtain the IO information of the disk of the proxy node, and filter the IO information according to preset rules, so that the filtered IO information can be used to form a prediction data set; The SMART module is configured to collect SMART information of the disks of the agent nodes. It should be noted that this application does not involve the specific structural improvement of the proxy node and the prediction center, but only involves processing the collected data, which will not be repeated here.

In addition, referring to FIG. 3 , in one embodiment, the IO information further includes IO time information, and step S120 in the embodiment shown in FIG. 1 also includes but is not limited to the following steps:

Step S310, determine the forecast period, and determine the period data set from the forecast data set according to the forecast period and the IO time information;

Step S320, according to the periodic data set and the prediction model, obtain the periodic failure probability of the disk to be predicted in the prediction period;

Step S330, determining the prediction result of the disk to be predicted according to the periodic failure probability.

In one embodiment, the prediction period can be selected according to actual needs. For example, in order to determine the near-term failure risk prediction of the disk to be predicted, the prediction period is determined to be several days, one week, or two weeks. The failure risk can be determined as one month, which can be adjusted according to the actual demand. It can be understood that the prediction period can be any number, for example, data of one week, two weeks, and four weeks are acquired at the same time, and the prediction result of each prediction period is acquired, so that the failure risk prediction of the disk to be predicted is more accurate.

In one embodiment, as shown in FIG. 2 , the collected prediction data set is stored in the disk information management module. In order to reduce the storage pressure, life cycle management can also be performed on the collected data. For example, the set prediction period is For one week, the life cycle management can be used to make the IO information and SMART information in the prediction data set in the disk information management module both the data collected within one week. The specific life cycle management method is not an improvement made in this embodiment, and is not Repeat.

In one embodiment, the IO time information can be the specific time information of the disk to be predicted to perform the IO, such as the time when the IO starts to be executed, or the time when the IO execution is completed, and the specific selection criteria are adjusted according to the actual situation. Yes, there are no restrictions here.

In one embodiment, the cycle failure probability is the prediction cycle as the length, the failure probability of the disk to be predicted, for example, the prediction cycle is one week, and the obtained cycle data set is the IO information and SMART information collected in the past week, according to the above information. Perform failure prediction, and the obtained periodic failure probability is the failure probability of the disk to be predicted in the next week.

In addition, referring to FIG. 4 , in an embodiment, step S330 in the embodiment shown in FIG. 3 further includes but is not limited to the following steps:

Step S410, when the cycle failure probability is greater than a preset probability threshold corresponding to the prediction cycle, determine the prediction result as a high risk;

Step S420, when the periodic failure probability is less than or equal to the probability threshold, the prediction result is determined to be low risk.

It should be noted that the probability threshold can be determined according to the actual risk management requirements. For example, if the probability threshold is greater than 80%, the failure risk is high, and the probability threshold is less than or equal to 80%. The failure risk is low. This embodiment does not specify the probability threshold. A limitation is made. Of course, several probability thresholds corresponding to several risk levels can also be set according to actual needs, which will not be repeated here.

It should be noted that using the risk level as the prediction result can reflect the failure probability of the detected disk, so that the disk replacement time can be arranged in advance when the prediction result is detected as high risk, especially to reduce the failure of multiple disks at the same time. situation, reducing the risk of data loss. At the same time, by setting a probability threshold corresponding to a low risk, an alarm will not be generated when the periodic failure probability is lower than the probability threshold, thereby effectively reducing the false alarm rate.

It can be understood that the cycle failure probability is the prediction result corresponding to the prediction cycle, and will not affect the prediction results of different prediction cycles. For example, the cycle failure probability with a prediction cycle of one week and the cycle failure with a prediction cycle of two weeks The probability is an independent parameter, that is, when determining the prediction result with a prediction period of two weeks, the failure probability of a period with a prediction period of one week will not be considered, which will not be repeated here.

In addition, referring to FIG. 5 , in an embodiment, step S410 in the embodiment shown in FIG. 4 further includes but is not limited to the following steps:

Step S510, when the periodic failure probability is greater than the probability threshold, determine that the prediction period corresponding to the periodic failure probability is a high-risk period;

Step S520, when the number of times that the prediction period is determined to be a high-risk period is greater than a preset alarm number threshold, the prediction result is determined to be a high-risk period.

In one embodiment, using the alarm number threshold can effectively reduce the number of false alarms. In the actual prediction process, due to the use of periodic prediction, it is very likely that abnormal data on a certain day will cause the prediction result to be high risk. Therefore, through multiple predictions , which can effectively reduce the deviation of prediction results caused by occasional anomalies. The specific alarm number threshold can be adjusted according to actual needs, which is not limited here.

In one embodiment, when it is determined that the prediction result is high risk, alarm information may also be generated according to basic information of the disk to be predicted. It can be understood that the alarm information can be generated by the alarm management module shown in Figure 2, for example, the alarm information is pushed to the background management system, and the basic information of the disk is carried in the alarm information, so that the maintenance personnel can timely and accurately perform the maintenance of the disk. .

In addition, referring to FIG. 6, in one embodiment, the IO information also includes the IO duration and the IO size, and the cache disk acceleration scenario of the disk to be predicted is determined by the following steps:

Step S610, obtaining the IOPS of the disk to be predicted, and determining the duration threshold according to the IOPS and IO size of the disk to be predicted;

Step S620, when the IO duration is greater than the duration threshold, it is determined that the disk to be predicted is in a cache disk acceleration scenario.

It should be noted that the IOPS of the disk to be predicted can be obtained in any way, such as reading the IOPS performance parameters of the disk to be predicted, or it can be obtained by performing several IO tests on the disk to be predicted by the actual test method. The specific method is based on actual needs. Just select it.

It can be understood that the duration threshold can be obtained by dividing the IO size by IOPS. Since IOPS is used to represent the read and write operation capability of the disk, the duration threshold can be used to represent the minimum allowable time required for the disk to process a specific IO size. When the IO duration is If it is greater than the minimum allowable time, it can be determined that the IO execution is in the scenario of cache disk acceleration.

It is worth noting that it is possible to pass an IO

In addition, in one embodiment, the SMART information includes at least one of the following:

Cumulative start and stop times;

Cumulative loading and unloading times;

number of bad sectors of growth;

non-media error count;

Number of unfixable errors.

In one embodiment, the SMART information may include any available attributes, such as disk health score (SMART Health Status), accumulated start-stop cycles (Accumulated start-stop cycles), accumulated load-unload cycles (Accumulated load-unload cycles) , the number of growing bad sectors (Elements in grown defective list), the count of non-medium errors (Non-medium error count) and the number of uncorrectable errors, where the number of uncorrectable errors can include the number of uncorrectable read errors (Total uncorrected read errors) and total uncorrected write errors (Total uncorrected write errors), those skilled in the art have the motivation to increase or decrease specific disk parameters according to actual needs, which are not limited here.

It can be understood that, based on the above disk parameters, in order to characterize the failure risk of the disk, the rate of change and incremental value of each disk parameter can be used. The incremental value may be an absolute value of the incremental value, which can be used to characterize the variation range of the disk parameters. The greater the variation range of the disk parameters, the greater the risk of disk failure.

In addition, referring to FIG. 7 , an embodiment of the present application further provides a prediction model training method, including but not limited to step S710 and step S720.

Step S710, obtain the prediction training sample set of the training sample disk, the prediction training sample set includes the training sample IO information of several training samples 10 and the training sample SMART information corresponding to the training sample 10, wherein, the prediction training sample set is collected in the training sample 10. Cache disk acceleration scenarios for sample disks.

In one embodiment, the acquisition method of the prediction training sample set can be obtained from the cache disk acceleration scenario described in the embodiment shown in FIG. 1 through the IOPS performance model. The IOPS performance model can be obtained by manually testing different large blocks of IO in different queues. The IOPS performance under depth is obtained. It is understandable that the IOPS performance model can be used to characterize the read and write capabilities of the disk. Therefore, for a certain number of IOs, the estimated allowable time can be calculated by using the IOPS performance model, that is, the above-mentioned duration threshold. If the actual processing duration is greater than the duration threshold, it can be considered that the IO comes from a cache disk acceleration scenario and can be determined as a valid sample. It can be understood that the prediction training sample set can be collected periodically, for example, collected once a day, and the specific period can be selected according to actual needs.

It is worth noting that, for the principle that the prediction training samples are collected in the cache disk acceleration scenario, reference may be made to the principle described in the embodiment of FIG. 2 , which will not be repeated here.

Step S720, train the prediction model according to the prediction training sample set.

In an embodiment, the training of the prediction model may be trained once a day, or may be adjusted according to actual needs, which is not limited herein. It can be understood that when the training sample set includes several sample subsets, training can be carried out according to the sample subsets. Training is performed for the corresponding period, so that the prediction model can perform failure prediction for different periods.

In one embodiment, the prediction model may use a common model framework, such as the LightGBM framework. It should be noted that before training the prediction model, the basic parameters of the model need to be set. For example, when the model framework used is the above-mentioned LightGBM framework, the framework parameters can be set according to the following table 1:

参数名称parameter name	值value
Learing rateLearning rate	0.350.35
Iteration roundsIteration rounds	110110
Cross validationCross validation	55
Total sample numberTotal sample number	51605160
Terminal conditionTerminal condition	10 ^-4 10 ^-4

Table 1 Model framework parameter configuration table

In addition, referring to FIG. 8 , in one embodiment, the training sample IO information includes the training sample IO duration and the training sample IO size, and step S710 in the embodiment shown in FIG. 7 also includes but is not limited to the following steps:

Step S810, obtaining the IOPS of the training sample disk, and determining the training sample duration threshold according to the IOPS of the training sample disk and the IO size of the training sample;

Step S820, when the training sample IO duration is greater than the training sample duration threshold, it is determined that the training sample disk is in a cache disk acceleration scenario.

It should be noted that, for the principle of determining that the training sample disk is in the cache disk acceleration scenario, reference may be made to the description of the embodiment shown in FIG. 6 , which is not repeated here for the sake of simplicity.

In addition, referring to FIG. 9 , in one embodiment, step S810 in the embodiment shown in FIG. 8 further includes but is not limited to the following steps:

Step S910, determining that all IOs of the training sample disk in the cache disk acceleration scenario are candidate IOs;

Determine the training sample IO from all the IOs of the training sample disk according to the preset conditions, and determine the IO information of the training sample IO as the training sample IO information;

Step S920, determining the training sample 10 from the candidate 10 according to the preset condition, and determining the IO information of the training sample 10 as the training sample 10 information;

Step S930, obtains the training sample SMART information corresponding to the training sample 10 from the SMART information of the training sample disk;

Step S940, preprocess the training sample IO information and the training sample SMART information, and generate a prediction training sample set according to the preprocessed training sample IO information and the training sample SMART information.

It should be noted that although most of the IOs in the cache disk acceleration scenario are large IOs, not all IOs can be used for model training. Therefore, the IOs in the cache disk acceleration scenario need to be determined as alternative IOs first. Then, the training sample IO is screened from the candidate IO according to the preset conditions.

In one embodiment, after obtaining the training sample IO and the training sample SMART information, the preprocessing performed on the training sample can check the validity of the training sample, check whether the training sample meets the time requirement, and can also increase or decrease the corresponding data according to actual needs. operations, such as handling the imbalance of positive and negative samples, etc., will not be repeated here. It is understandable that checking the legitimacy of training samples is mainly used to ensure that the candidate IOs obtained are continuous, and to avoid IOs whose acquisition process is interrupted as training samples. For example, during the execution of a certain IO, the disk is powered off. , then the IO is discontinuous IO, and its IO information has a large deviation, which cannot be applied to training, so this type of samples can be removed by preprocessing. It is understandable that checking whether the training samples meet the time requirements can be determined according to the set training period. For example, if the longest training period is set to be four weeks, then the training samples before four weeks will be removed to ensure the timeliness of the data. .

In one embodiment, after the training sample IO and the training sample SMART information are obtained, feature expansion can also be performed, which is beneficial to increase the discretization of data. For example, the IO information includes time delay information, status information, and IO time information. On the basis of , the following feature expansion is performed on the training sample IO: several delay segments are preset, such as 0 to 32 ms, 32 ms to 64 ms, 64 ms to 128 ms, 128 ms to 512 ms, >= 512 ms, According to the delay information in the IO information of the training samples, determine the delay segment in which the IO information of each training sample is located, and determine the percentage of each delay segment. The proportion is appropriately weighted, so that the proportion of higher delay segments occupies a higher voice, highlighting the health threat caused by high delay; IO error rate, the status information in each delay segment is IO error IO The proportion of the overall IO; the average value of the first N delays in each delay segment in descending order, and the value of N can be selected according to actual needs. It can be understood that most of the SMART information is statistical data, so the rate of change and the absolute value of each statistical data in the SMART information can be obtained to achieve feature expansion, which is not repeated here.

In addition, referring to FIG. 10 , in one embodiment, the training sample IO information further includes state information, delay information and IO time information, and the preset condition includes at least one of the following:

Status information is an error status used to characterize IO errors;

The IO size is greater than the preset IO size threshold;

It is currently determined that the number of training samples IO is less than the preset number threshold;

The IO time information conforms to the preset sample collection cycle;

The delay information satisfies a preset delay distribution interval.

In one embodiment, the preset condition may be judged and data filtered by the IO module shown in FIG. 2 , for example, a statistics list may be determined according to the collection period. For the convenience of description, the preset conditions of this embodiment are illustrated below with reference to FIG. 10 :

After the IO module obtains the candidate IO, it first judges whether the status information of the candidate IO is an IO error, and if so, directly adds it to the statistics list, and collects the IO whose status information is an IO error and uses it for training, so that the prediction model can be more accurate. Predict the probability that the disk may be in error. If the status information of the candidate IO is correct, then determine whether the IO size satisfies the IO size threshold. Based on the analysis of the above embodiment, the delay characteristics of large block IO can be used to predict disk failure. Therefore, this The embodiment may only count IOs of a specific size, for example, only count IOs in the range of 128K to 512K. If the size of the candidate IO is greater than the IO size threshold, the candidate IO is added to the candidate list, otherwise the number of IOs is judged , to avoid insufficient collection of IO. When the number of candidate IOs that have been collected exceeds the number threshold, enough training sample IOs have been collected. At this time, the candidate IOs in the statistical list and the candidate IOs can be determined as training sample IOs without collecting the current IOs. The candidate IO; if the number threshold is not exceeded, the current candidate IO needs to be judged. For example, through the IO time information of the candidate IO, it is determined that the candidate IO is within the sample collection period. If not, the candidate IO can be judged. The collection time of the selected IO does not meet the collection period, and since the collection of the candidate IO is collected in chronological order, it can be determined that the collection of the candidate IO has expired at this time, and the candidate IO can be cleaned up and stopped. Collection, if the collection period does not exceed, the candidate IO is a valid IO, and it is added to the statistics list, and the candidate IO in the statistics list and the candidate list is determined as the training sample IO.

It can be understood that the preset delay distribution interval can be set according to actual needs, for example, the delay distribution interval in the above embodiment: 0 to 32 milliseconds, 32 milliseconds to 64 milliseconds, 64 milliseconds to 128 milliseconds, and 128 milliseconds to 512 milliseconds, >= 512 milliseconds, when the delay information of the training sample IO satisfies the above delay distribution interval, it can be further determined as an available training sample IO.

In addition, referring to FIG. 11 , in an embodiment, step S720 in the embodiment shown in FIG. 7 further includes but is not limited to the following steps:

Step S1110, obtaining a preset training period, and determining a period sample set corresponding to the training period according to the IO time information of the training sample IO;

Step S1120, train the prediction model according to the periodic sample set.

In one embodiment, the training period can be selected according to actual needs, for example, according to the current time, the training samples of the first week, the first two weeks and the first four weeks are obtained as the period sample set, so that the prediction model can be performed according to different prediction periods. Disk failure prediction.

It can be understood that, after the training period is determined, the prediction data set can be collected according to the same period when performing disk failure prediction, so as to obtain the prediction result in the corresponding period.

In addition, in one embodiment, the training sample SMART information includes at least one of the following:

Cumulative start and stop times;

Cumulative loading and unloading times;

number of bad sectors of growth;

non-media error count;

Number of unfixable errors.

It should be noted that, for the selection of the SMART information of the training samples, reference may be made to the selection principle of the SMART information in the above-mentioned disk failure prediction method, which is not repeated here for the sake of simplicity.

In addition, referring to FIG. 12 , in an embodiment, step S720 in the embodiment shown in FIG. 7 further includes but is not limited to the following steps:

Step S1210, dividing a training sample set and a test sample set from the prediction training sample set according to a preset ratio;

Step S1220: Train the prediction model according to the training sample set, and perform a verification test on the trained prediction model according to the test sample set.

In one embodiment, the preset ratio may be any value, which can be adjusted according to actual requirements, for example, the training sample set and the test sample set are divided according to the ratio of 8:2.

It should be noted that, the feature expansion operation in the foregoing embodiment may be performed before or after segmentation of the prediction training sample set, which is not limited in this embodiment.

It should be noted that when verifying the prediction model through the test sample set, common test indicators can be used and thresholds can be set for judgment, such as false discovery rate (False Discovery Rate, FDR), false acceptance rate (False Accept Rate FAR) ), the specific threshold setting standard can be adjusted according to actual needs, which is not limited here.

In addition, referring to FIG. 13 , an embodiment of the present application further provides an electronic device, the electronic device 1300 includes: a memory 1310 , a processor 1320 , and a computer program stored on the memory 1310 and executable on the processor 1320 .

The processor 1320 and the memory 1310 may be connected by a bus or otherwise.

The non-transitory software programs and instructions required to implement the disk failure prediction method of the above embodiment are stored in the memory 1310, and when executed by the processor 1320, the disk failure prediction method applied to the electronic device 1300 in the above embodiment is executed, For example, performing the above-described method steps S110 to S120 in FIG. 1 , method steps S310 to S330 in FIG. 3 , method steps S410 to S420 in FIG. 4 , and method steps S510 to S520 in FIG. 5 , Method steps S610 to S620 in FIG. 6 , method steps S710 to S720 in FIG. 7 , method steps S810 to S820 in FIG. 8 , method steps S910 to S940 in FIG. 9 , method steps in FIG. 11 Steps S1110 to S1120, steps S1210 to S1220 of the method in FIG. 12 .

The embodiment of the present application includes: acquiring a prediction data set of a disk to be predicted, the prediction data set including IO information of the prediction sample 10 and SMART information corresponding to the prediction sample 10, wherein the prediction data set is collected in the Describe the cache disk acceleration scene of the disk to be predicted; input the predicted data set into a pre-trained prediction model, and obtain the prediction result of the disk to be predicted. According to the solution provided by the embodiment of the present application, the disk failure prediction can be performed for all types of disks in combination with IO information and SMART information, which effectively reduces the risk of data loss.

The apparatus embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or controller, for example, by the above-mentioned Executed by a processor in the embodiment of the electronic device, the above-mentioned processor can execute the disk failure prediction method applied to the electronic device in the above-mentioned embodiment, for example, execute the above-described method steps S110 to S120 in FIG. 1 . The method steps S310 to S330 in FIG. 3, the method steps S410 to S420 in FIG. 4, the method steps S510 to S520 in FIG. 5, the method steps S610 to S620 in FIG. 6, and the method step S710 in FIG. 7 To step S720, the method steps S810 to S820 in FIG. 8, the method steps S910 to S940 in FIG. 9, the method steps S1110 to S1120 in FIG. 11, and the method steps S1210 to S1220 in FIG. 12. Those of ordinary skill in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

The above is a specific description of some implementations of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or replacements without departing from the scope of the present application. These equivalents Variations or substitutions of the above are all included within the scope defined by the claims of the present application.

Claims

A disk failure prediction method, comprising:

Obtain the predicted data set of the disk to be predicted, the predicted data set includes the IO information of the predicted sample IO and the SMART information corresponding to the predicted sample IO, wherein the predicted data set is collected in the cache of the to-be-predicted disk Disk acceleration scene;

The prediction data set is input into the pre-trained prediction model, and the prediction result of the disk to be predicted is obtained.
The method according to claim 1, wherein the IO information further includes IO time information, and the prediction result of the disk to be predicted is obtained by inputting the prediction data set into a pre-trained prediction model, comprising: :

determining a forecast period, and determining a period data set from the forecast data set according to the forecast period and the IO time information;

According to the cycle data set and the prediction model, obtain the cycle failure probability of the disk to be predicted in the prediction cycle;

The prediction result of the disk to be predicted is determined according to the periodic failure probability.
The method according to claim 2, wherein the determining the prediction result of the disk to be predicted according to the periodic failure probability comprises:

When the cycle failure probability is greater than a preset probability threshold corresponding to the prediction cycle, determining the prediction result as a high risk;

When the periodic failure probability is less than or equal to the probability threshold, the predicted result is determined to be low risk.
The method according to claim 3, wherein, when the periodic failure probability is greater than the probability threshold, determining the prediction result as a high risk comprises:

When the periodic failure probability is greater than the probability threshold, determine that the prediction period corresponding to the periodic failure probability is a high-risk period;

When the number of times that the prediction period is determined to be a high-risk period is greater than a preset alarm number threshold, the prediction result is determined to be a high-risk period.
The method according to claim 1, wherein the IO information further includes IO duration and IO size, and the cache disk acceleration scenario of the to-be-predicted disk is determined by the following steps:

Obtain the IOPS of the disk to be predicted, and determine the duration threshold according to the IOPS of the disk to be predicted and the IO size;

When the IO duration is greater than the duration threshold, it is determined that the disk to be predicted is in a cache disk acceleration scenario.
The method of claim 1, wherein the SMART information includes at least one of the following:

Cumulative start and stop times;

Cumulative loading and unloading times;

number of bad sectors of growth;

non-media error count;

Number of unfixable errors.
A predictive model training method, comprising:

Obtain the prediction training sample set of the training sample disk, the prediction training sample set includes the training sample IO information of the training sample 10 and the training sample SMART information corresponding to the training sample 10, wherein the prediction training sample set is collected in The cache disk acceleration scene of the training sample disk;

The prediction model is trained according to the prediction training sample set.
The method according to claim 7, wherein the training sample IO information includes the training sample IO duration and the training sample IO size, and the cache disk acceleration scene of the training sample disk is determined by the following steps:

Obtain the IOPS of the training sample disk, and determine the training sample duration threshold according to the IOPS of the training sample disk and the IO size of the training sample;

When the IO duration of the training sample is greater than the training sample duration threshold, it is determined that the training sample disk is in a cache disk acceleration scenario.
The method according to claim 8, wherein the obtaining the prediction training sample set of the training sample disk comprises:

Determine that all IOs of the training sample disk in the cache disk acceleration scenario are candidate IOs;

Determine the training sample 10 from the candidate 10 according to the preset condition, and determine the IO information of the training sample 10 as the training sample 10 information;

Obtain the training sample SMART information corresponding to the training sample 10 from the SMART information of the training sample disk;

The training sample IO information and the training sample SMART information are preprocessed, and a prediction training sample set is generated according to the preprocessed training sample IO information and the training sample SMART information.
The method according to claim 9, wherein the training sample IO information further includes state information, delay information and IO time information, and the preset condition includes at least one of the following:

The state information is an error state used to characterize IO errors;

The IO size is greater than the preset IO size threshold;

It is currently determined that the number of training samples IO is less than the preset number threshold;

The IO time information conforms to the preset sample collection period;

The delay information satisfies a preset delay distribution interval.
The method according to claim 10, wherein the training the prediction model according to the prediction training sample set further comprises:

Obtain a preset training period, and determine a period sample set corresponding to the training period according to the IO time information of the training sample 10;

The prediction model is trained based on the periodic sample set.
The method according to any one of claims 7 to 9, wherein the training sample SMART information includes at least one of the following:

Cumulative start and stop times;

Cumulative loading and unloading times;

number of bad sectors of growth;

non-media error count;

Number of unfixable errors.
The method according to claim 8, wherein the training the prediction model according to the prediction training sample set further comprises:

Splitting the training sample set and the test sample set from the prediction training sample set according to a preset ratio;

The prediction model is trained according to the training sample set, and a verification test is performed on the trained prediction model according to the test sample set.
An electronic device, comprising: a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements any one of claims 1 to 6 when the processor executes the computer program The disk failure prediction method, or the prediction model training method according to any one of claims 7 to 13 is performed.
A computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions are used to execute the disk failure prediction method as claimed in any one of claims 1 to 6, or to execute the method as claimed in claim 7 The prediction model training method described in any one of to 13.