CN114943321A

CN114943321A - Fault prediction method, device and equipment for hard disk

Info

Publication number: CN114943321A
Application number: CN202110172914.6A
Authority: CN
Inventors: 刘冬实; 康炳南; 纪晓峰; 胡崝
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2022-08-26
Also published as: WO2022166481A1

Abstract

A fault location device firstly obtains a first attribute value of a hard disk, inputs the first attribute value into a fault prediction model, and obtains a first abnormal score which indicates the running state of the hard disk under the first attribute value. The fault prediction model is trained based on a second attribute value during normal operation of the hard disk, and the second attribute value is used for indicating the operation state of a plurality of components during normal operation of the hard disk; and comparing the first anomaly score with a threshold value, wherein when the first anomaly score is larger than the threshold value, the hard disk can run but has a fault risk, and a fault can occur at a certain future time. And training the fault prediction model by using the attribute value of the hard disk in normal operation, and improving the accuracy of the fault prediction model. Therefore, the output value output by the fault prediction model can reflect the running state of the hard disk more accurately, and whether the hard disk has a fault risk or not can be accurately determined.

Description

Fault prediction method, device and equipment for hard disk

Technical Field

The present application relates to the field of storage technologies, and in particular, to a method, an apparatus, and a device for predicting a failure of a hard disk.

Background

At present, a Solid State Drive (SSD) is usually adopted in a storage system as a main data storage device. The reliability of the storage system is affected by the failure rate of the data storage device.

For this reason, it is necessary to timely troubleshoot possible faults in the data storage device, locate the cause of the faults in the data storage device, and correct and process the faults in the data storage device.

Therefore, active failure prediction is particularly important, but the current failure prediction for the SSD is low in accuracy, and whether the SSD has a potential failure or not cannot be accurately determined.

Disclosure of Invention

The application provides a fault prediction method, a fault prediction device and fault prediction equipment for a hard disk, which are used for accurately predicting hard disk faults.

In a first aspect, an embodiment of the present application provides a method for predicting a failure of a hard disk, where the method is executed by a failure location device, and in the method, the failure location device may first obtain a first attribute value of the hard disk, where the first attribute value of the hard disk may be an attribute value before the hard disk fails (that is, the hard disk is still operable), and the first attribute value is used to indicate an operation state of a plurality of components in the hard disk. After obtaining the first attribute value of the hard disk, the fault location device may input the first attribute value into the fault prediction model, and obtain an output value corresponding to the first attribute value. The fault prediction model is trained in advance, and the fault prediction model is trained based on a second attribute value when the hard disk normally operates; namely, the fault prediction model is not trained on the attribute values when the hard disk fails or when the fault risk exists. The second attribute value is used for indicating the running state of a plurality of components when the hard disk runs normally; after acquiring the first anomaly score, the fault location device may compare the first anomaly score with a threshold, and when the first anomaly score is greater than the threshold, the hard disk is considered to be operable but at risk of a fault, and the fault may occur at some future time.

By the method, the fault positioning device can train the fault prediction model by using the attribute value of the hard disk during normal operation, and the accuracy of the fault prediction model is improved. And then the fault positioning device can determine whether the hard disk has fault risks according to the comparison of the output value of the fault prediction model and the threshold value, and the output value output by the fault prediction model with higher accuracy can reflect the running state of the hard disk more accurately, so that whether the hard disk has the fault risks can be determined accurately.

In one possible design, when the failure prediction model is trained using the second attribute values, a training set may be constructed using the second attribute values, and then the failure prediction model is trained using an unsupervised learning manner based on the training set.

By the method, the unsupervised learning mode is utilized, the balance of training set samples does not need to be considered, and the accuracy of the fault prediction model can be ensured, so that the fault risk of the hard disk can be accurately predicted in the subsequent process.

In one possible design, after the second attribute value is input to the failure prediction model, an output value corresponding to the second attribute value may be obtained, and the output value corresponding to the second attribute value may be referred to as a second anomaly score. The threshold may be determined based on the plurality of second anomaly scores, e.g., the threshold may be not less than a maximum of the plurality of second anomaly scores, or the threshold may be equal to a quantile of the plurality of second anomaly scores.

By the method, the threshold determined according to the second abnormal value can clearly distinguish the boundary of the abnormal value when the hard disk normally runs and the boundary of the abnormal value when the hard disk has fault risk, so that the result determined by comparing the first abnormal value with the threshold is more accurate.

In one possible design, the failure prediction model may be structured in a variety of ways, for example, the failure prediction model may include VAE and LSTM

Through the method, the VAE and the LSTM are combined, so that more second attribute values can be input into the fault prediction model at one time, and the fault prediction model can learn the time sequence dependence relationship between the second attribute values in the training process.

In one possible design, the first attribute value includes a plurality of status values, one status value indicating an operational status of one component in the hard disk, and different status values may indicate operational statuses of different components in the hard disk.

By the method, the running states of different components are indicated through different state values, and the indication mode is clearer and simpler.

In one possible design, the first abnormal score is equal to the sum of the products of the state values and the corresponding weights in the first attribute values, and the fault locating device can determine that the hard disk has the fault risk and can also determine the components with the fault risk in the hard disk according to the first abnormal score to locate the fault risk reason. For example, the fault locating device may determine, from the first anomaly score, a corresponding target state value with the largest weight from the plurality of state values; and then, determining a fault risk reason according to the target state value, wherein the fault risk reason is that the component indicated by the target state value has a fault risk.

By the method, the accuracy of fault prediction is guaranteed, the fault positioning device can also determine the components with the fault risks, the components with the fault risks in the hard disk are effectively positioned, guidance suggestions are given to maintenance personnel, and the fault risks of the hard disk can be relieved in advance.

In a possible design, when the fault location device inputs the first attribute value into the fault prediction model and outputs the first abnormal score, the fault location device may perform preprocessing on the first attribute value, and then input the preprocessed first attribute value into the fault prediction model and output the first abnormal score, where the preprocessing includes some or all of the following: screening and normalizing.

By the method, the first attribute value is preprocessed, so that the first attribute value can be conveniently processed by a subsequent fault prediction model, the first abnormal value is quickly acquired, and the fault prediction efficiency is improved.

In a possible design, under the condition that the first abnormal score is not larger than the threshold value, the fault positioning device can also indicate the hard disk to normally operate, and inform a user that the hard disk is normal in time, so that the user experience is improved.

In a second aspect, the present application provides a fault location device having the functionality as embodied in any one of the possible designs of the first aspect and the first aspect. The functions of the device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more units corresponding to the above functions. In a possible design, the structure of the apparatus includes an obtaining unit, a score determining unit, and a risk determining unit, which may perform corresponding functions in the method example of the first aspect, for specific reference, detailed description in the method example is given, and details are not repeated here.

In a third aspect, the present application further provides a computing device, and for beneficial effects, reference may be made to the description of the first aspect and any one of possible designs of the first aspect, which is not described herein again. The structure of the computing device comprises a processor and a memory, and the processor is configured to perform corresponding functions in the first aspect and any one of the possible design methods of the first aspect. The memory is coupled to the processor and holds the necessary program instructions and data for the fault locating device. The structure of the computing device further includes a communication interface for communicating with other devices, such as receiving the first attribute value, or sending a failure cause of the hard disk.

In a fourth aspect, the present application also provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.

In a fifth aspect, the present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

In a sixth aspect, the present application further provides a computer chip, where the chip is connected to a memory, and the chip is used to read and execute a software program stored in the memory, and execute the method in the foregoing aspects.

Drawings

FIG. 1 is a block diagram of a system according to the present application;

FIG. 2 is a schematic diagram of a model training method provided herein;

fig. 3 is a schematic diagram of a failure prediction method for a hard disk according to the present application;

fig. 4 is a schematic diagram of a failure prediction method for SSD according to the present application;

fig. 5 is a schematic structural diagram of a fault location device provided in the present application;

fig. 6 is a schematic structural diagram of a computing device provided in the present application.

Detailed Description

The embodiment of the present application provides a failure prediction method, which may be used to predict whether a failure risk exists in a data storage device capable of operating, and in the embodiment of the present application, the data storage device is a hard disk (such as an SSD or another type of hard disk) for example. It should be understood that the present invention is also applicable to data storage devices that can be used for data storage, except for hard disks, and the specific implementation manner is similar to the manner of providing the failure prediction in the embodiments of the present application, and specific reference may be made to the relevant description in the embodiments of the present application, and details are not described here again.

Referring to fig. 1, a system architecture applicable to the embodiment of the present application is illustrated, and the system architecture is schematically illustrated, and the system includes a data acquisition device 100, a model training device 200, and a fault location device 300.

The data collection device 100 may be connected to the storage system 400, and obtain attribute values (e.g., a first attribute value and a second attribute value) of the hard disk 410 in the storage system 400. In the embodiment of the present application, the attribute value of the hard disk 410 may indicate the operating status of the components in the hard disk 410. The type and number of components in the hard disk 410 are not limited herein, and the components include some or all of the following: magnetic head, disk, motor, circuit, controller, flash memory chip, firmware.

When the attribute value of the hard disk 410 indicates the operation state of the component in the hard disk 410, a direct indication mode may be adopted, for example, the operation state of the component in the hard disk 410 is indicated by a busy coefficient of operation of the component, and a high busy coefficient of the component indicates that the component is actively operating and has high efficiency; if the running state of the storage area in the hard disk 410 is indicated by the number of bad blocks in the disk, and if the number of bad blocks in the disk is small, the normal running and high storage efficiency of the storage area in the disk are indicated. Indirect indication may also be used, such as indicating the operational status of components in the hard disk 410 by the amount of errors in data in the hard disk 410, describing the storage status of storage areas in the disks of the hard disk 410 by the number of uncorrectable errors occurring in the hard disk 410, describing the storage status of individual blocks of the disks of the hard disk 410 by the programmed number of blocks indication.

When the attribute value indicates an operation state of a plurality of components in the hard disk 410, the attribute value may include a plurality of state values, one state value indicating an operation state of one component.

The present application does not limit the manner in which the data acquisition apparatus 100 acquires the attribute value, and for example, the data acquisition apparatus 100 may be connected to a management device in the storage system 400, and request the management device for the attribute value of the hard disk in the storage system 400.

For another example, the data acquisition apparatus 100 may be directly connected to a hard disk in the storage system 400 to acquire an attribute value of the hard disk. The attribute values in the hard disk 410 may be generated by the hard disk 410 itself, for example, a sensor may be mounted on each component of the hard disk 410, the sensor is configured to detect an operating state of the component, and the processing unit in the hard disk 410 may acquire the plurality of state values through the sensor mounted on each component, and send the plurality of state values as the attribute values to the data acquisition apparatus 100.

For example, the attribute value may be a self-monitoring analysis and reporting technology (SMART) attribute value. In the SMART, the running states of each component in the hard disk 410 can be monitored by calling a detection instruction in the hard disk 410, and an attribute value is generated. The attribute value may be provided to the data collection device 100 by the hard disk 410. The number of the storage systems 400 connected to the data acquisition device 100 and the number of the hard disks 410 in the storage systems 400 are not limited in the embodiments of the present application. Two storage systems 400 and portions of a hard disk 410 are depicted in FIG. 1 by way of example only.

After acquiring the attribute value of the hard disk 410, the data acquisition device 100 may send the acquired attribute value of the hard disk 410 to the model training device 200, the model training device 200 may train a fault prediction model based on the attribute value of the hard disk 410, and after the model training device 200 completes the training of the fault prediction model, the fault prediction model may be configured to the fault location device 300. After the data acquisition device 100 acquires the attribute value of the hard disk 410, the acquired attribute value of the hard disk 410 may also be sent to the fault location device 300, and the fault location device 300 may implement fault location according to the attribute value of the hard disk 410 and the fault prediction model, determine whether a fault risk exists in the hard disk, and when determining that the fault risk exists in the hard disk, may further determine a fault cause of the hard disk 410 to determine a component in which the fault risk exists in the hard disk 410.

In order to distinguish between the attribute values sent by the data acquisition device 100 to the model training device 200 and the attribute values sent to the fault location device 300, in the embodiment of the present application, the attribute values sent by the data acquisition device 100 to the model training device 200 are referred to as second attribute values, and the attribute values sent by the data acquisition device 100 to the fault location device 300 are referred to as first attribute values.

The deployed positions of the data acquisition device 100, the model training device 200, and the fault location device 300 are not limited in the embodiments of the present application. Any one of the data acquisition apparatus 100, the model training apparatus 200, and the fault location apparatus 300 may be operated on a cloud computing device system (including at least one cloud computing device, such as a server, etc.), an edge computing device system (including at least one edge computing device, such as a server, a desktop, etc.), or various terminal computing devices, for example: notebook computers, personal desktop computers, and the like. For example, the model training apparatus 200 and the fault locating apparatus 300 may be deployed in a cloud computing device system or an edge computing device system, and the data acquisition apparatus 100 may be deployed on a terminal computing device near the storage system 400 or the hard disk 410. For another example, the data collection apparatus 100, the model training apparatus 200, and the fault location apparatus 300 may be respectively operated in three environments, namely, a cloud computing device system, an edge computing device system, or a terminal computing device.

The data acquisition device 100, the model training device 200, and the fault location device 300 may be independent hardware devices, and are connected by communication paths. Some or all of the three devices, i.e., the data acquisition device 100, the model training device 200, and the fault location device 300, may be combined in one hardware device, and for example, the model training device 200 and the fault location device 300 may be combined in one hardware device, which can implement both the training of the fault prediction model and the fault location of the hard disk 410. For example, the data collection device 100, the model training device 200, and the fault location device 300 may be combined into one hardware device, and have the functions of collecting attribute values, training a fault prediction model, and locating a fault in the hard disk 410.

The above-mentioned hardware device is not limited to a specific form in the embodiment of the present application, and may be a server, a service cluster, or a terminal computing device.

In the failure prediction method for the hard disk 410 provided in the embodiment of the present application, a failure prediction model needs to be used, and a training method of the failure prediction model is described below first, referring to fig. 2.

Step 201: the data acquisition apparatus 100 acquires a plurality of second attribute values of the hard disk 410, where the second attribute values are acquired when the hard disk 410 operates normally, that is, the second attribute values indicate the operating states of components in the hard disk 410 under the condition that the hard disk 410 operates normally.

That is, the second attribute values collected by the data collection device 100 are all the attribute values when the hard disk 410 is in normal operation, and are not the attribute values when the hard disk 410 is in failure. It should be noted that the normal operation of the hard disk 410 means that each component in the hard disk has no risk of failure, that is, the hard disk 410 operates when the hard disk has no risk of failure. In some possible scenarios, even if components in the hard disk are aged or slightly damaged within an allowable range, the hard disk 410 may still operate, and there is no risk of failure, that is, the hard disk 410 is not prone to failure, and in such a scenario, the hard disk 410 may be considered to operate normally.

The data collection device 100 may acquire a plurality of second attribute values of the hard disk 410. For example, at different time periods, the attribute value of the hard disk 410 is obtained, and the obtained attribute value at each time period is a second attribute value. In order to ensure the accuracy of the failure prediction model, the data acquisition apparatus 100 may acquire as many second attribute values as possible.

Here, the number of the hard disks 410 is not limited, and may be one or more. Also, the type of the hard disk 410 is not limited, and taking the hard disk as an SSD as an example, the data acquisition device 100 may obtain a plurality of second attribute values of SSDs of different models.

Step 202: the data acquisition device 100 transmits the acquired plurality of second attribute values of the hard disk 410 to the model training device 200.

When the data acquisition apparatus 100 executes step 202, the acquired plurality of second attribute values may be directly transmitted to the model training apparatus 200, or the plurality of second attribute values may be preprocessed and the preprocessed plurality of second attribute values may be transmitted to the model training apparatus 200.

There are many ways of preprocessing, two of which are listed below, and it should be understood that other preprocessing operations on the plurality of second attribute values are also applicable to the embodiments of the present application.

And in the first mode, a plurality of second attribute values are screened.

For example, the data collection device 100 may remove the same attribute value from the plurality of second attribute values, and for example, the data collection device 100 may filter the status values included in each second attribute value, such as selecting a valid status value from the second attribute values, such as the data collection device 100 may keep the status value recording the number of uncorrectable errors, the status value recording the number of block programming errors, and the status value recording the number of newly added bad blocks from the second attribute values.

And in the second mode, the second attribute value is subjected to normalization processing.

For example, when the second attribute value is a single value, the second attribute value may be normalized to a value in the interval of 0 to 1. For another example, when the second attribute value includes a plurality of state values, the plurality of state values may be respectively normalized to a numerical value in the range of 0 to 1.

Through the preprocessing, the fault prediction model can be conveniently trained by utilizing the preprocessed second attribute value, and the training process is simplified.

Step 203: after receiving the plurality of second attribute values, the model training apparatus 200 may construct a training set using the plurality of second attribute values, and train the fault prediction model using the training set. The plurality of second attribute values received by the model training apparatus 200 may be a preprocessed plurality of second attribute values, and the training set is constructed by using the plurality of second attribute values. The plurality of second attribute values received by the model training apparatus 200 may also be a plurality of second attribute values that are not preprocessed, and the model training apparatus 200 may directly construct a training set using the plurality of second attribute values. The model training device 200 may perform preprocessing on the plurality of second attribute models after receiving the plurality of second attribute values, for a description of the preprocessing, refer to the foregoing, and details are not described here, and the model training device 200 may construct a training set by using the plurality of preprocessed second attribute values.

In step 203, the model training apparatus 200 may train the failure prediction model in an unsupervised learning manner. The unsupervised learning means that the data (such as the second attribute value) in the training set is not provided with a label, and the structural characteristics of the data in the training set are learned to realize classification.

The model training apparatus 200 may input each second attribute value into the value failure prediction model to obtain an output value corresponding to each second attribute value. In the unsupervised learning field, the output value corresponding to each second attribute value may be understood as a classification result of the plurality of second attribute values. The output values corresponding to the plurality of second attribute values represent the overall operating state of the hard disk 410 when the hard disk 410 is operating normally. For example, when a certain attribute value is input to the output value corresponding to the attribute value in the failure prediction model, the output value corresponding to the one second attribute value is the same as the output value corresponding to the one second attribute value, or is within a range formed by the maximum value and the minimum value of the output values corresponding to the plurality of second attribute values, it is indicated that the attribute value indicates that the hard disk 410 is operating normally. For convenience of description, the output value corresponding to the failure prediction model is referred to as an abnormal score, and the output value corresponding to the second attribute value is referred to as a second abnormal score, which may indicate an operation state of the hard disk 410 at the second attribute value.

The embodiment of the present application does not limit the structure of the fault prediction model, and any model that can be generated in an unsupervised learning manner and can be used for realizing fault prediction is only applicable to the embodiment of the present application.

For example, the failure prediction model may include a variational auto-encoder (VAE) and a long-term memory network (LSTM). The VAE is a deep generation model and comprises an encoding network and a decoding network. The VAE may encode the input (e.g., the second attribute value) into a random variable in the hidden space (the encoding process), and then use the decoding network to restore the random variable in the hidden space to be close to or the same data as the input (the decoding process). The VAE learns data characteristics of attribute values (e.g., second attribute values) under normal operation of the hard disk 410 during encoding-decoding, when performed using a training set constructed of the second attribute values with the goal of maximizing reconstructed data. This data characteristic is that the reconstruction error is small when the VAE faces the attribute value of the hard disk 410 under normal operation, and is large when the VAE faces the attribute value of the hard disk 410 under failure; the reconstruction error can be represented as an abnormal score output by the VAE, and the larger the reconstruction error is, the higher the fault degree of the hard disk 410 is, the smaller the reconstruction error is, and the lower the fault degree of the hard disk 410 is; and determining a threshold value based on each second abnormal score set generated by the training set, and when a new attribute value is detected, if the abnormal score of the attribute value is greater than the threshold value, determining that a fault risk exists, otherwise, normally operating the hard disk 410. In order to capture the time sequence dependency relationship of the attribute values of the hard disk 410 (that is, the sequence of the attribute values of the hard disk 410 in time), when a fault prediction model is constructed, the LSTM may be added before the input of the coding network and the decoding network in the VAE, so as to enhance the characterization capability of the model. Because the input window of the LSTM can be adjusted, the number of the second attribute values processed at one time can be changed by adjusting the input window, when a plurality of second attribute values are input at one time, the coding network and the decoding network in the VAE can learn the time sequence dependence relationship of the plurality of second attribute values, and finally the second abnormal scores of the plurality of second attribute values are determined to show a gradual change trend to a certain extent.

This is because the failure of the hard disk 410 is gradually transited from normal operation to failure, and usually the failure does not occur suddenly, so there is a tendency that the abnormal score is somewhat gradual.

It should be noted that the anomaly score output by the failure prediction model may be a weighted sum of the input attribute values (e.g., the second attribute value and the first attribute value) of each state value. That is, each state value corresponds to a weight, and the sum of the products of each state value and the corresponding weight is equal to the anomaly score of the attribute value. The process of outputting the abnormal score by the fault prediction model can be regarded as a process of determining the weight values of the state values and summing the weight values.

Step 204: after the training of the fault prediction model is completed (e.g., the loss function of the fault prediction model converges), the model training device 200 may send the fault prediction model to the fault location device 300.

When the fault location device 300 and the model training device 200 are combined to be a hardware device, the hardware device can perform fault location by using the fault prediction model after training the fault prediction model.

According to the training process of the fault prediction model, the training set of the fault prediction model is formed by the attribute values of the hard disk 410 in normal operation, the training set is simple to construct, and the attribute values of the hard disk 410 in fault or fault risk are not required to be considered. The training process of the fault prediction model is simpler and more efficient, and the accuracy of the trained fault prediction model is higher.

By configuring the failure prediction model in the failure positioning device 300 in the above manner, the failure positioning device 300 can implement failure positioning by using the failure prediction model, and determine whether the hard disk has a failure risk. A method for predicting a failure of a hard disk 410 according to an embodiment of the present application is described below with reference to fig. 3, where the method includes:

step 301: the data acquisition device 100 acquires a first attribute value of the hard disk 410, where the first attribute value may be an attribute value of the hard disk 410 acquired by the data acquisition device 100 and currently running. The first attribute value of the hard disk 410 may be an attribute value of the hard disk 410 in normal operation, or an attribute value of the hard disk 410 when there is a risk of failure. The first attribute value may represent the current operation state of the hard disk 410, in which case, the fault location device 300 may be used to perform online location (i.e., real-time location) of the fault. The first attribute value may also represent an operational state of the hard disk 410 over time, in which case offline localization of the fault may be achieved using the fault localization apparatus 300.

Step 302: the data collection device 100 sends the first attribute value to the fault location device 300. Before the data acquisition device 100 sends the first attribute value, the first attribute value may be preprocessed, and for the preprocessing, reference may be made to the foregoing description, and details are not described here again.

Step 303: after receiving the first attribute value, the fault location device 300 inputs the first attribute value to the fault prediction model to obtain an output value corresponding to the first attribute value, and for convenience of description, the output value of the first attribute value is referred to as a first abnormal score of the first attribute value. The first anomaly score characterizes the operational state of the hard disk 410 under the first attribute value.

The first attribute value received by the fault locating device 300 may be a preprocessed first attribute value, which the fault locating device may directly input to the fault prediction model. The first attribute value received by the fault location device 300 may also be a first attribute value that is not preprocessed, and the fault location device may directly input the first attribute value to the fault prediction model (corresponding to a scenario in which, when the fault prediction model is trained, the second attribute value in the training set is a second attribute value that is not preprocessed); the fault location device 300 may also perform preprocessing on the first attribute after receiving the first attribute value, and input the preprocessed first attribute value into the fault prediction model (corresponding to a scenario where the second attribute value in the training set is the preprocessed second attribute value when the fault prediction model is trained), for the description of preprocessing, refer to the foregoing contents, which are not described herein again, and the fault location device 300 may input the preprocessed first attribute value into the fault prediction model.

Step 304: the fault location device 300 compares the first anomaly score with a threshold value, which is determined according to the second anomaly score, and the threshold value may be not less than the maximum value of the second anomaly scores acquired in the embodiment shown in fig. 2, or may be the quantiles of the plurality of second anomaly scores.

Step 305: the fault locating device 300 determines that the hard disk 410 is at risk of a fault if it determines that the first anomaly score is greater than the threshold. Further, the fault location apparatus 300 may further determine a fault cause of the hard disk 410 according to the first anomaly score.

And when the first anomaly score is not greater than the threshold value, the hard disk 410 is indicated to operate normally. When the first anomaly score is larger than the threshold value, the hard disk 410 is indicated to have fault risks, and the fault reason can be continuously determined.

As can be seen from the description of the second anomaly score in fig. 2, the first anomaly score is a weighted sum of a plurality of state values, where the greater the weight, it will be easier to make the first anomaly score be greater than the threshold, and the fault location apparatus 300 can determine, according to the weight, a part or all of the state values with greater weight from which components indicated by the part or all of the state values may have a fault risk, so that the hard disk 410 has a fault risk, that is, the fault cause of the hard disk 410 is that the components indicated by the part or all of the state values have a fault risk.

Taking the number of state values included in the first attribute value equal to K as an example, the fault location device 300 may determine a state value with the largest weight, or determine a state value located in the first N bits after the weights are sorted from large to small (where K > N, K, N is a positive integer), and refer to the determined state value as a target state value. N may be an empirical value. Further, the fault location device 300 may determine that the component described by the target state value is at risk of fault and is the cause of the fault of the hard disk 410.

As can be seen from the embodiment shown in fig. 3, the fault location device 300 utilizes a pre-trained fault prediction model when determining whether the hard disk 410 has a fault according to the first attribute value; since the fault prediction model is trained by using the attribute values of the hard disk 410 during normal operation, when a fault is predicted, the fault location device 300 can accurately determine whether the hard disk 410 has a fault risk according to the comparison between the output value of the fault prediction model and the threshold value.

In the embodiment of the present application, the failure prediction model is allowed to be updated, for example, the model training device 200 may obtain a new second attribute value through the data acquisition device 100, continue training the failure prediction model, and update the failure prediction model into the failure location device 300 after the training is completed. The failure prediction model may be updated in an online updating manner, that is, the model training device 200 may obtain a new second attribute value during the operation of the failure location device 300, continue training the failure prediction model, and update the failure prediction model into the failure location device 300 after the training is completed. Therefore, the accuracy of the fault prediction model can be ensured, so that the fault positioning device 300 can accurately judge whether the hard disk 410 has a fault risk by subsequently utilizing the updated fault prediction model, and accurate fault positioning is realized.

Referring to fig. 4, which is a schematic diagram of a failure prediction method for a hard disk 410 according to an embodiment of the present application, in fig. 4, a data acquisition device 100 acquires a SMART attribute value of an SSD, and sends the SMART attribute value to a model training device 200 after preprocessing. The model training device 200 trains the VAE-LSTM model and determines a threshold. After training is completed, the model training device 200 configures the VAE-LSTM model and the threshold value into the fault locating device 300, the fault locating device 300 obtains the SMART attribute value of the preprocessed SSD from the data acquisition device 100, outputs an abnormal score value by using the VAE-LSTM model, determines that the SSD normally operates or has a fault risk according to comparison between the abnormal score value and the threshold value, and under the condition that the fault risk exists, the fault locating device 300 determines the fault reason of the hard disk 410 according to the abnormal score value to realize fault locating. The model training device 200 may update the failure prediction model, and configure the updated failure prediction model in the failure localization device 300.

It should be noted that, in the embodiment of the present application, step 305 (that is, determining the failure cause of the hard disk 410 according to the first anomaly score) may also be implemented by using a model, and for convenience of description, the model may be referred to as a failure analysis model. The fault analysis model may obtain weights of each state value in the first anomaly score when the first anomaly score is greater than the threshold, determine a target state value, and further determine a component with a fault, the fault analysis model may be merged with the fault prediction model, the merged model is referred to as a fault location model, and after the model training apparatus 200 completes training of the fault prediction model, the fault analysis model may be merged into the fault prediction model to form a fault location model, which is configured to the fault location apparatus 300. In this way, the fault location apparatus 300 may determine the cause of the fault of the hard disk 410 using the fault location model.

Based on the same inventive concept as the method embodiment, an embodiment of the present application further provides a fault location device, which is used for executing the method executed by the fault location device in the method embodiment, and related features may refer to the method embodiment, which is not described herein again, as shown in fig. 5, the fault location device 500 includes an obtaining unit 501, a score determining unit 502, and a risk determining unit 503.

An obtaining unit 501 is configured to obtain a first attribute value of a hard disk, where the first attribute value is used to indicate an operating state of a plurality of components in the hard disk.

The score determining unit 502 is configured to input the first attribute value into a failure prediction model, and output a first anomaly score, where the first anomaly score is used to indicate an operating state of the hard disk under the first attribute value, and the failure prediction model is generated based on a second attribute value of the hard disk, where the second attribute value is used to indicate an operating state of the plurality of components when the hard disk operates normally. In the embodiment of the present application, the normal operation of the hard disk means that each component in the hard disk does not have a fault risk, or within an allowable range, each component in the hard disk is considered to be capable of operating normally, so that the fault risk of the hard disk is not caused.

And a risk determining unit 503, configured to determine that the hard disk has a failure risk when the first anomaly score is greater than the threshold.

As a possible implementation manner, when training the failure prediction model, a training set may be formed by using the second attribute values, and based on the training set, the failure prediction model may be trained in an unsupervised learning manner.

As a possible implementation, the output value obtained by inputting the second attribute value to the failure prediction model may be referred to as a second anomaly score, and the threshold value used for comparison with the first anomaly score may be determined according to the plurality of second anomaly scores, for example, the threshold value may be set to be not less than the maximum value of the plurality of second anomaly scores, and for example, the threshold value may be set to be equal to the quantiles of the plurality of second anomaly scores.

As a possible implementation, the present application does not limit the structure of the fault prediction model, and for example, the fault prediction model may include VAE and LSTM.

As a possible implementation, the first attribute value includes a plurality of status values, and one status value is used to indicate an operation status of one component in the hard disk.

As a possible implementation manner, the first anomaly score is equal to a sum of products of each state value and the corresponding weight in the first attribute value, and the risk determining unit 503 may first determine, according to the first anomaly score, a corresponding state value with the largest weight from the plurality of state values, which may be referred to as a target state value for convenience of description; and then determining a fault reason according to the target state value, wherein the fault reason indicates that the component indicated by the target state value has a fault.

As a possible embodiment, when the first attribute value is input to the failure prediction model and the first abnormal score is output, the score determination unit 502 may perform preprocessing on the first attribute value, and then input the preprocessed first attribute value to the failure prediction model and output the first abnormal score, where the preprocessing includes some or all of the following: screening and normalizing.

As a possible implementation, the risk determination unit 503 indicates that the hard disk is operating normally if the first anomaly score is not greater than the threshold.

The division of the units in the embodiments of the present application is schematic, and only one logic function division is used, and in actual implementation, there may be another division manner, and in addition, each functional unit in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one module by two or more units. The integrated unit can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated unit, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a terminal device (which may be a personal computer, a mobile phone, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In a simple embodiment, those skilled in the art will appreciate that the fault locating device 400 may take the form shown in FIG. 6 in the embodiment shown in FIG. 2 or 3.

The computing device 600 shown in fig. 6 includes at least one processor 601, memory 602, and optionally a communication interface 603.

The memory 602 may be a volatile memory, such as a random access memory; the memory may also be non-volatile memory such as, but not limited to, a read-only memory, flash memory, Hard Disk Drive (HDD) or solid-state drive (SSD), a physical disk or memory 602 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 602 may be a combination of the above.

The specific connection medium between the processor 601 and the memory 602 is not limited in the embodiments of the present application.

The processor 601 may be a Central Processing Unit (CPU), and the processor 601 may also be other general purpose processor, a Digital Signal Processor (DSP), an application specific integrated circuit (asic), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, artificial intelligence chip, chip on chip, etc. The general purpose processor may be a microprocessor or any conventional processor or the like. The processor 601 may perform data transmission, such as receiving a first detection instruction or a second detection instruction, through the communication interface 603 when communicating with other devices.

When the fault locating apparatus takes the form shown in fig. 6, the processor 601 in fig. 6 may execute the instructions by calling a computer stored in the memory 602, so that the computing device may execute the method performed by the fault locating apparatus 300 in any of the method embodiments described above.

In particular, the functions/implementation processes of the obtaining unit, the score determining unit, and the risk determining unit of fig. 5 may all be implemented by the processor 601 in fig. 6 calling a computer executing instruction stored in the memory 602. Alternatively, the functions/implementation procedures of the score determining unit and the risk determining unit in fig. 5 may be implemented by the processor 601 in fig. 6 calling a computer-executable instruction stored in the memory 602, and the functions/implementation procedures of the obtaining unit in fig. 5 may be implemented by the communication interface 603 in fig. 6.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications can be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A failure prediction method for a hard disk, the method comprising:

acquiring a first attribute value of a hard disk, wherein the first attribute value is used for indicating the running state of a plurality of components in the hard disk;

inputting the first attribute value into a fault prediction model, and outputting a first abnormal score, wherein the first abnormal score is used for indicating the running state of the hard disk under the first attribute value, the fault prediction model is generated based on second attribute values of the hard disk in a training mode, and the second attribute values are used for indicating the running states of a plurality of components when the hard disk runs normally;

and determining that the hard disk has a fault risk under the condition that the first abnormal score is larger than a threshold value.

2. The method of claim 1, wherein the fault prediction model is generated by training in an unsupervised learning manner based on a training set including the second attribute values.

3. The method of claim 2, wherein the threshold value is determined based on a plurality of second anomaly scores, the second anomaly scores being output values obtained by inputting the second attribute values into the fault prediction model.

4. The method according to any one of claims 1 to 3, wherein the failure prediction model comprises a Variational Automatic Encoder (VAE) and a long-term memory network (LSTM).

5. The method of any of claims 1 to 4, wherein the first attribute value comprises a plurality of status values, one of the status values being indicative of an operational status of a component in the hard disk.

6. The method of claim 5, wherein the first anomaly score is equal to a sum of products of each of the state values and a corresponding weight in the first attribute value, the method further comprising:

determining a corresponding target state value with the maximum weight from the plurality of state values according to the first abnormal score;

and determining the fault reason according to the target state value, wherein the fault reason indicates that the component indicated by the target state value has a fault risk.

7. The method according to any one of claims 1 to 6, wherein the inputting the first attribute value into a fault prediction model and outputting a first anomaly score comprises:

preprocessing the first attribute value, inputting the preprocessed first attribute value into the fault prediction model, and outputting the first abnormal score, wherein the preprocessing comprises part or all of the following steps: screening and normalizing.

8. The method of any of claims 1 to 7, further comprising:

and indicating the hard disk to normally operate under the condition that the first abnormal score is not larger than the threshold value.

9. The fault positioning device is characterized by comprising an acquisition unit, a score determination unit and a risk determination unit;

the acquisition unit is used for acquiring a first attribute value of a hard disk, and the first attribute value is used for indicating the running state of a plurality of components in the hard disk;

the score determining unit is used for inputting the first attribute value into a failure prediction model and outputting a first abnormal score, wherein the first abnormal score is used for indicating the running state of the hard disk under the first attribute value, the failure prediction model is generated based on second attribute values of the hard disk in a training mode, and the second attribute values are used for indicating the running states of a plurality of components when the hard disk runs normally;

and the risk determining unit is used for determining that the hard disk has a failure risk under the condition that the first abnormal score is larger than a threshold value.

10. The apparatus of claim 9, wherein the fault prediction model is generated by training in an unsupervised learning manner based on a training set including the second attribute values.

11. The apparatus of claim 10, wherein the threshold value is determined based on a plurality of second anomaly scores, the second anomaly scores being output values obtained by inputting the second attribute values into the fault prediction model; the threshold is not less than a maximum of the plurality of second anomaly scores.

12. The apparatus of any of claims 9 to 11, wherein the failure prediction model comprises a variational autoencoder VAE and a long-term memory network LSTM.

13. The apparatus of any of claims 9 to 12, wherein the first attribute value comprises a plurality of status values, one of the status values being indicative of an operational status of a component in the hard disk.

14. The apparatus of claim 13, wherein the first anomaly score is equal to a sum of products of each of the state values and a corresponding weight in the first attribute value, the risk determination unit further to:

according to the first abnormal score, determining a corresponding target state value with the maximum weight from the plurality of state values;

15. The apparatus according to any one of claims 9 to 14, wherein the score determining unit is configured to, when the first attribute value is input to the failure prediction model and the first anomaly score is output, specifically:

16. The apparatus according to any of claims 9 to 15, wherein the risk determining unit is further configured to:

17. A computing device, wherein the computing device comprises a processor and a memory;

the memory for storing computer program instructions;

the processor executing computer program instructions that call into the memory performs the method of any of claims 1-8.

18. A computer-readable storage medium having computer-executable instructions stored thereon for causing a computer to perform the method of any one of claims 1 to 8.