WO2022166481A1 - 一种针对硬盘的故障预测方法、装置及设备 - Google Patents

一种针对硬盘的故障预测方法、装置及设备 Download PDF

Info

Publication number
WO2022166481A1
WO2022166481A1 PCT/CN2021/142559 CN2021142559W WO2022166481A1 WO 2022166481 A1 WO2022166481 A1 WO 2022166481A1 CN 2021142559 W CN2021142559 W CN 2021142559W WO 2022166481 A1 WO2022166481 A1 WO 2022166481A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute value
hard disk
prediction model
fault
value
Prior art date
Application number
PCT/CN2021/142559
Other languages
English (en)
French (fr)
Inventor
刘冬实
康炳南
纪晓峰
胡崝
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022166481A1 publication Critical patent/WO2022166481A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0409Online test

Definitions

  • the present application relates to the field of storage technologies, and in particular, to a failure prediction method, device and device for hard disks.
  • SSD solid state drive
  • the present application provides a failure prediction method, device and equipment for a hard disk, so as to realize accurate prediction of the failure of the hard disk.
  • an embodiment of the present application provides a fault prediction method for a hard disk.
  • the method is executed by a fault locating device.
  • the fault locating device can first obtain a first attribute value of the hard disk, and the first attribute value of the hard disk.
  • the value may be the property value before the hard disk fails (that is, the hard disk can still operate), and the first property value is used to indicate the operating status of the multiple components in the hard disk.
  • the fault locating device may input the first attribute value into the fault prediction model, and obtain an output value corresponding to the first attribute value.
  • the first attribute value corresponds to The output value of is called the first abnormal score
  • the first abnormal score can indicate the operating state of the hard disk under the first attribute value.
  • the failure prediction model is pre-trained, and the failure prediction model is trained based on the second attribute value of the hard disk in normal operation; that is, the failure prediction model is not trained based on the attribute value of the hard disk failure or failure risk.
  • the second attribute value is used to indicate the operating states of multiple components when the hard disk is running normally; after acquiring the first abnormal score, the fault locating device can compare the first abnormal score with the threshold, and when the first abnormal score is greater than At the threshold, the hard drive is considered operational, but at risk of failure, possibly at some point in the future.
  • the fault locating device can use the attribute values of the hard disk in normal operation to train the fault prediction model, thereby improving the accuracy of the fault prediction model. After that, the fault locating device can determine whether the hard disk has a failure risk according to the comparison between the output value of the fault prediction model and the threshold value. The output value output by the fault prediction model with higher accuracy can more accurately reflect the operating state of the hard disk. Accurately determine whether a hard drive is at risk of failure.
  • a training set can be constructed by using the second attribute value, and then, based on the training set, the fault prediction model is trained by an unsupervised learning method .
  • the unsupervised learning method does not need to consider the balance of the training set samples, and the accuracy of the failure prediction model can be guaranteed, so that the failure risk of the hard disk can be accurately predicted in the future.
  • an output value corresponding to the second attribute value can be obtained, and the output value corresponding to the second attribute value can be called the second abnormal score.
  • the threshold may be determined according to a plurality of second abnormal scores, for example, the threshold may not be less than the maximum value among the plurality of second abnormal scores, or the threshold may be equal to the quantile of the plurality of second abnormal scores.
  • the threshold determined according to the second abnormal score can more clearly distinguish the boundary of the abnormal score when the hard disk is running normally and when the hard disk is at risk of failure, so that the result determined by comparing the first abnormal score with the threshold is more accurate precise.
  • the failure prediction model can include VAE and LSTM
  • the combination of VAE and LSTM enables the fault prediction model to input more second attribute values at one time, and the fault prediction model can learn the time-series dependencies between the second attribute values during the training process.
  • the first attribute value includes a plurality of status values, one status value is used to indicate the running status of a component in the hard disk, and different status values may indicate the running status of different components in the hard disk.
  • the first abnormal score is equal to the sum of the products of each state value in the first attribute value and the corresponding weight.
  • the fault locating device can not only determine that the hard disk has a risk of failure, but also can determine according to the first abnormal score.
  • the components in the hard disk with the risk of failure can locate the cause of the failure risk. For example, the fault locating device may determine the target state value with the largest corresponding weight from the plurality of state values according to the first abnormal score; then, determine the cause of the fault risk according to the target state value, and the cause of the fault risk is the one indicated by the target state value. Components are at risk of failure.
  • the fault locating device can also determine the components with the risk of failure, effectively locate the components with the risk of failure in the hard disk, and give guidance to the maintenance personnel, which can ensure that the risk of failure of the hard disk can be avoided in advance. lifted.
  • the first attribute value when the fault locating device inputs the first attribute value into the fault prediction model and outputs the first abnormal score, the first attribute value may be preprocessed first, and then the preprocessed The first attribute value is input into the fault prediction model, and the first abnormal score is output.
  • the preprocessing includes part or all of the following: screening processing and normalization processing.
  • the subsequent fault prediction model can process the first attribute value more conveniently, obtain the first abnormal score quickly, and speed up the efficiency of fault prediction.
  • the fault locating device may further indicate that the hard disk is running normally, and timely inform the user that the hard disk is normal, so as to improve user experience.
  • the present application provides a fault location device, the device has the functions implemented in the first aspect and any possible design of the first aspect.
  • the function of the device may be implemented by hardware, or by executing corresponding software by hardware.
  • the hardware or software includes one or more units corresponding to the above-mentioned functions.
  • the structure of the device includes an acquisition unit, a score determination unit, and a risk determination unit, and these units can perform the corresponding functions in the method example of the first aspect. For details, please refer to the detailed description in the method example, here I won't go into details.
  • the present application further provides a computing device, and the beneficial effects can be found in the description of the first aspect and any possible design of the first aspect, and will not be repeated here.
  • the structure of the computing device includes a processor and a memory, and the processor is configured to perform the corresponding functions in the first aspect and any possible design method of the first aspect.
  • a memory is coupled to the processor and holds program instructions and data necessary for the fault locating device.
  • the structure of the computing device also includes a communication interface for communicating with other devices, such as receiving the first attribute value, or sending the failure cause of the hard disk.
  • the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes the methods of the above aspects.
  • the present application also provides a computer program product comprising instructions, which, when executed on a computer, cause the computer to perform the methods of the above aspects.
  • the present application further provides a computer chip, where the chip is connected to a memory, and the chip is used to read and execute a software program stored in the memory, and execute the methods of the above aspects.
  • FIG. 1 is a schematic diagram of the architecture of a system provided by the application.
  • FIG. 2 is a schematic diagram of a model training method provided by the application
  • FIG. 3 is a schematic diagram of a failure prediction method for a hard disk provided by the application.
  • FIG. 4 is a schematic diagram of a fault prediction method for SSD provided by the present application.
  • FIG. 5 is a schematic structural diagram of a fault location device provided by the application.
  • FIG. 6 is a schematic structural diagram of a computing device provided by the present application.
  • Embodiments of the present application provide a fault prediction method, which can be used to predict whether there is a risk of failure in a data storage device that can operate.
  • the data storage device is a hard disk (such as SSD or other types of hard disk ) as an example. It should be understood that the same applies to data storage devices other than hard disks that can be used for data storage.
  • the specific implementation is similar to the failure prediction method provided in the embodiment of the present application. For details, please refer to the relevant description of the embodiment of the present application. It is not repeated here.
  • FIG. 1 a schematic diagram of a system architecture provided by the embodiment of the present application is provided.
  • the system includes a data acquisition device 100 , a model training device 200 , and a fault location device 300 .
  • the data collection device 100 may be connected to the storage system 400 to acquire attribute values (eg, the first attribute value and the second attribute value) of the hard disk 410 in the storage system 400 .
  • the attribute value of the hard disk 410 may indicate the running status of the components in the hard disk 410 .
  • the types and numbers of components in the hard disk 410 are not limited here, and the components include some or all of the following: magnetic heads, disks, motors, circuits, controllers, flash memory chips, and firmware.
  • the operating status of the component in the hard disk 410 is indicated by the operating busy factor of the component, and the high busy factor of the component indicates the component. It is actively running and has high efficiency; for example, the running status of the storage area in the hard disk 410 is indicated by the number of bad blocks in the disk. When the number of bad blocks in the disk is small, it means that the storage area in the disk is running normally and the storage efficiency is high.
  • An indirect indication method can also be used, such as indicating the operating status of components in the hard disk 410 by the amount of data errors in the hard disk 410, such as describing the storage area of the storage area in the disk in the hard disk 410 by the number of uncorrectable errors occurring in the hard disk 410
  • the state as indicated by the block programming number, describes the storage state of each block of the disk in the hard disk 410 .
  • the attribute value When the attribute value indicates the running status of multiple components in the hard disk 410, the attribute value may include multiple status values, and one status value is used to indicate the running status of one component.
  • the present application does not limit the manner in which the data collection apparatus 100 obtains attribute values.
  • the data collection apparatus 100 may be connected to a management device in the storage system 400 and request the management device for the attribute value of the hard disk in the storage system 400 .
  • the data collection apparatus 100 may directly connect to the hard disk in the storage system 400 to obtain the attribute value of the hard disk.
  • the attribute values in the hard disk 410 may be generated by the hard disk 410 itself.
  • sensors may be installed on each component of the hard disk 410, and the sensor is used to detect the running state of the component.
  • the processing unit in the hard disk 410 may be installed on each component by The sensor of the device acquires the plurality of state values, and sends the plurality of state values to the data acquisition device 100 as attribute values.
  • the attribute value may be a self-monitoring analysis and reporting technology (SMART) attribute value.
  • SMART is an automatic hard disk status detection and early warning system and specification.
  • the running status of each component in the hard disk 410 can be monitored by calling the detection instruction in the hard disk 410, and an attribute value can be generated.
  • the attribute value may be provided to the data collection device 100 by the hard disk 410 .
  • This embodiment of the present application does not limit the number of storage systems 400 connected to the data collection apparatus 100 and the number of hard disks 410 in the storage system 400 . In FIG. 1 , only two storage systems 400 and a part of the hard disk 410 are shown by way of example.
  • the data acquisition device 100 can send the acquired attribute value of the hard disk 410 to the model training device 200, and the model training device 200 can train the fault prediction model based on the attribute value of the hard disk 410.
  • the fault prediction model may be configured to the fault location apparatus 300 .
  • the data acquisition device 100 can also send the acquired attribute value of the hard disk 410 to the fault locating device 300, and the fault locating device 300 can realize fault locating according to the attribute value of the hard disk 410 and the fault prediction model, It is determined whether there is a risk of failure in the hard disk, and when it is determined that there is a risk of failure in the hard disk, the cause of the failure of the hard disk 410 can be further determined to determine the components in the hard disk 410 that have the risk of failure.
  • the attribute values sent by the data collection apparatus 100 to the model training apparatus 200 are referred to as the first Two attribute values, the attribute value sent by the data acquisition device 100 to the fault location device 300 is called the first attribute value.
  • the embodiments of the present application do not limit the locations where the data collection apparatus 100 , the model training apparatus 200 , and the fault location apparatus 300 are deployed.
  • Any one of the data collection apparatus 100, the model training apparatus 200, and the fault location apparatus 300 may run in a cloud computing device system (including at least one cloud computing device, such as a server, etc.), or may run in an edge computing device system (including at least one edge computing device, such as a server, a desktop computer, etc.), and can also run on various terminal computing devices, such as a notebook computer, a personal desktop computer, and the like.
  • the model training apparatus 200 and the fault location apparatus 300 can be deployed in a cloud computing equipment system or an edge computing equipment system, and the data acquisition apparatus 100 can be deployed on terminal computing equipment near the storage system 400 or the hard disk 410 .
  • the data collection apparatus 100 , the model training apparatus 200 , and the fault location apparatus 300 may respectively run in three environments of cloud computing equipment system, edge computing equipment system or terminal computing equipment.
  • the three devices, the data acquisition device 100 , the model training device 200 , and the fault location device 300 may be independent hardware devices connected by a communication path. Some or all of the data acquisition device 100 , the model training device 200 , and the fault location device 300 may also be combined in one hardware device. For example, the model training device 200 and the fault location device 300 may be combined into one The hardware device can not only implement the training of the fault prediction model, but also implement the fault location of the hard disk 410 .
  • the data collection apparatus 100 , the model training apparatus 200 and the fault location apparatus 300 may be combined into one hardware apparatus, which has the functions of attribute value collection, fault prediction model training, and fault location of the hard disk 410 .
  • the above-mentioned hardware device does not limit the specific form in the embodiments of the present application, and may be a server, a service cluster, or a terminal computing device.
  • failure prediction method for the hard disk 410 provided by the embodiment of the present application, a failure prediction model needs to be used, and the training method of the failure prediction model will be described below, referring to FIG. 2 .
  • Step 201 The data collection device 100 obtains a plurality of second attribute values of the hard disk 410, and the second attribute values are collected when the hard disk 410 is operating normally, that is, the second attribute value indicates that the hard disk 410 is operating normally , the running status of the components in the hard disk 410 .
  • the second attribute values collected by the data collection device 100 are all attribute values when the hard disk 410 operates normally, not when the hard disk 410 fails.
  • the normal operation of the hard disk 410 means that each component in the hard disk has no risk of failure, that is, the hard disk 410 operates under the condition that there is no risk of failure of the hard disk.
  • the hard disk 410 can still run, and there is no risk of failure, that is, the hard disk 410 is not prone to failure. In this scenario, it can also be considered that the hard disk 410 operates normally.
  • the data collection apparatus 100 may acquire a plurality of second attribute values of the hard disk 410 . For example, in different time periods, the property values of the hard disk 410 are acquired, and the property value acquired in each time period is a second property value. In order to ensure the accuracy of the fault prediction model, the data acquisition apparatus 100 may acquire as many second attribute values as possible.
  • the number of hard disks 410 is not limited here, and may be one or more.
  • the type of the hard disk 410 is also not limited. Taking the hard disk as an SSD as an example, the data collection apparatus 100 can acquire multiple second attribute values of SSDs of different models.
  • Step 202 The data collection apparatus 100 sends the acquired second attribute values of the hard disk 410 to the model training apparatus 200 .
  • the data collection device 100 executes step 202, it can directly send the obtained plurality of second attribute values to the model training device 200, or it can preprocess the plurality of second attribute values, and then The second attribute value is sent to the model training apparatus 200 .
  • preprocessing There are many ways of preprocessing, two of which are listed below. It should be understood that other preprocessing operations on the plurality of second attribute values are also applicable to the embodiments of the present application.
  • the data collection apparatus 100 may remove the same attribute value from the plurality of second attribute values.
  • the data collection apparatus 100 may filter the state value included in each second attribute value, such as selecting the second attribute value.
  • the data acquisition device 100 can retain the status value of the number of uncorrectable errors, the number of block programming errors, and the number of newly added bad blocks in the second attribute value.
  • the second attribute value when the second attribute value is a single value, the second attribute value may be normalized to a value in the interval of 0-1.
  • the second attribute value when the second attribute value includes multiple state values, the multiple state values may be respectively normalized to a value in the interval of 0-1.
  • Step 203 after receiving the plurality of second attribute values, the model training apparatus 200 may use the plurality of second attribute values to construct a training set, and use the training set to train the fault prediction model.
  • the plurality of second attribute values received by the model training apparatus 200 may be a plurality of preprocessed second attribute values, and a training set is constructed by using the plurality of second attribute values.
  • the multiple second attribute values received by the model training apparatus 200 may also be multiple second attribute values that have not undergone preprocessing, and the model training apparatus 200 may directly use the multiple second attribute values to construct a training set.
  • the model training apparatus 200 may preprocess the plurality of second attribute models after receiving the plurality of second attribute values. For the description of the preprocessing, please refer to the foregoing content, which will not be repeated here.
  • the model training apparatus 200 A training set may be constructed using the preprocessed plurality of second attribute values.
  • the model training apparatus 200 may use an unsupervised learning manner to train the fault prediction model.
  • unsupervised learning means that the data in the training set (such as the second attribute value) is not labeled, and the structural characteristics of the data itself in the training set are learned to achieve classification.
  • the model training apparatus 200 may input each second attribute value into the value failure prediction model, and obtain an output value corresponding to each second attribute value.
  • the output value corresponding to each second attribute value can be understood as the classification result of the plurality of second attribute values.
  • the output values corresponding to the plurality of second attribute values represent the overall operating state of the hard disk 410 under the normal operation of the hard disk 410 . For example, when a certain attribute value is input to the output value corresponding to the attribute value in the fault prediction model, it is the same as the output value corresponding to the one second attribute value, or is the largest among the output values corresponding to the plurality of second attribute values.
  • the attribute value indicates that the hard disk 410 is running normally.
  • the output value corresponding to the fault prediction model is called the abnormal score
  • the output value corresponding to the second attribute value is called the second abnormal score, which can indicate the operating state of the hard disk 410 under the second attribute value.
  • the embodiments of the present application do not limit the structure of the fault prediction model, and any model that can be generated through unsupervised learning and can be used to implement fault prediction is only applicable to the embodiments of the present application.
  • the fault prediction model may include a variational auto encoder (VAE) and a long short-term memory (LSTM) network.
  • VAE is a deep generative model, and VAE includes two parts: encoding network and decoding network.
  • VAE can encode the input (such as the second attribute value) into a random variable in the latent space (encoding process), and then use the decoding network to restore the random variable in the latent space to data close to or the same as the input (decoding process) .
  • the VAE aims at maximizing the reconstructed data.
  • the data feature of the attribute value eg, the second attribute value
  • the data feature of the attribute value under normal operation of the hard disk 410 is learned during the encoding-decoding process.
  • This data characteristic is reflected in that the reconstruction error is small when the VAE faces the attribute values of the hard disk 410 under normal operation, and the reconstruction error is relatively large when the VAE faces the attribute values when the hard disk 410 is faulty; the reconstruction error can be reflected in the VAE output
  • LSTM can be added before the input of the encoding network and the decoding network in the VAE to enhance the The representational power of the model. Since the input window of LSTM can be adjusted, the number of second attribute values processed at one time can be changed by adjusting the input window. When multiple second attribute values are input at one time, the encoding network and decoding network in the VAE can learn the multiple second attribute values. The time-series dependencies of the second attribute values finally determine that the second abnormal scores of the plurality of second attribute values exhibit a gradual trend to a certain extent.
  • the abnormal score output by the fault prediction model may be the weighted sum of each state value in the input attribute values (eg, the second attribute value and the first attribute value). That is, each state value corresponds to a weight, and the sum of the products of each state value and the corresponding weight is equal to the abnormal score of the attribute value.
  • the process of outputting abnormal scores from the fault prediction model can be regarded as the process of determining the weight value of each state value and summing it up.
  • Step 204 After the model training apparatus 200 completes the training of the fault prediction model (eg, the loss function of the fault prediction model converges), the model training apparatus 200 may send the fault prediction model to the fault location apparatus 300 .
  • the model training apparatus 200 may send the fault prediction model to the fault location apparatus 300 .
  • the hardware device can use the trained fault prediction model to perform fault location after the fault prediction model is trained.
  • the training set of the failure prediction model is composed of the attribute values of the hard disk 410 during normal operation, and the construction of the training set is relatively simple. property value. It also makes the training process of the fault prediction model simpler and more efficient, and the accuracy of the trained fault prediction model is also higher.
  • the fault prediction model is configured in the fault locating apparatus 300 in the above manner, and the fault locating apparatus 300 can use the fault prediction model to realize fault location and determine whether the hard disk has a fault risk.
  • the following describes the failure prediction method for the hard disk 410 provided by the embodiment of the present application with reference to FIG. 3 . Referring to FIG. 3 , the method includes:
  • Step 301 the data collection apparatus 100 obtains a first attribute value of the hard disk 410 , and the first attribute value may be the currently running attribute value of the hard disk 410 collected by the data collection apparatus 100 .
  • the first attribute value of the hard disk 410 may be an attribute value when the hard disk 410 is in normal operation, or may be an attribute value when the hard disk 410 is at risk of failure.
  • the first attribute value may represent the current operating state of the hard disk 410.
  • the fault locating device 300 can be used to locate the fault online (ie, real-time locating).
  • the first attribute value can also represent the running state of the hard disk 410 at a certain time in the past. In this case, the fault location device 300 can be used to locate the fault offline.
  • Step 302 the data collection apparatus 100 sends the first attribute value to the fault location apparatus 300 .
  • the data collection apparatus 100 may preprocess the first attribute value. For the method of preprocessing, reference may be made to the foregoing description, which will not be repeated here.
  • Step 303 After receiving the first attribute value, the fault locating device 300 inputs the first attribute value into the fault prediction model, and obtains the output value corresponding to the first attribute value.
  • the output value of the first attribute value is The value is referred to as the first outlier score for the first attribute value.
  • the first abnormal score represents the operating state of the hard disk 410 under the first attribute value.
  • the first attribute value received by the fault locating apparatus 300 may be a preprocessed first attribute value, and the fault locating apparatus may directly input the first attribute value into the fault prediction model.
  • the first attribute value received by the fault locating apparatus 300 may also be an unpreprocessed first attribute value, and the fault locating apparatus may directly input the first attribute value into the fault prediction model (corresponding to the training of the fault prediction model when training the fault prediction model).
  • the second attribute value in the set is the second attribute value that has not been preprocessed); the fault locating apparatus 300 may also preprocess the first attribute after receiving the first attribute value, and the preprocessed
  • the first attribute value is input to the fault prediction model (corresponding to the scenario where the second attribute value in the training set is the preprocessed second attribute value when training the fault prediction model). Without further description, the fault locating apparatus 300 may input the preprocessed first attribute value into the fault prediction model.
  • Step 304 The fault location device 300 compares the first abnormal score with a threshold, the threshold is determined according to the second abnormal score, and the threshold may not be smaller than the second abnormal score obtained in the embodiment shown in FIG. 2 .
  • the maximum value in may also be the quantile of the multiple second abnormal scores.
  • Step 305 The fault locating apparatus 300 determines that the hard disk 410 has a fault risk when the first abnormal score is determined to be greater than the threshold. Further, the fault locating apparatus 300 may also determine the fault cause of the hard disk 410 according to the first abnormal score.
  • the first abnormal score When the first abnormal score is not greater than the threshold, it means that the hard disk 410 operates normally. When the first abnormal score is greater than the threshold, it indicates that the hard disk 410 has a risk of failure, and the cause of the failure can be further determined.
  • the first abnormality score is a weighted sum of multiple state values.
  • the larger the weight the easier it is to cause the first abnormality score to be greater than the threshold.
  • the fault location device 300 can The size of the weight determines some or all of the state values with larger weights.
  • the components indicated by the part or all of the state values may be at risk of failure, resulting in the risk of failure of the hard disk 410, that is, the cause of the failure of the hard disk 410 is this part or all of the state
  • the component indicated by the value is at risk of failure.
  • the fault locating apparatus 300 can determine the state value with the largest weight therefrom, or determine the state value located in the top N positions after the weight is sorted in descending order (wherein, K>N, K and N are positive integers), and the determined state value is called the target state value. N can be an experience value. Further, the fault locating apparatus 300 can determine that the component described by the target state value is at risk of failure and is the cause of the failure of the hard disk 410 .
  • the fault locating device 300 uses a pre-trained fault prediction model when judging whether the hard disk 410 is faulty according to the first attribute value; When the fault is predicted, the fault locating device 300 can accurately determine whether the hard disk 410 has a fault risk according to the comparison between the output value of the fault prediction model and the threshold value.
  • the model training device 200 can obtain a new second attribute value through the data acquisition device 100, continue to train the fault prediction model, and then update the fault prediction model after the training is completed.
  • the prediction model is updated into the fault location device 300 .
  • the fault prediction model can be updated in an online update manner, that is, the model training device 200 can obtain a new second attribute value during the operation of the fault location device 300, and continue to train the fault prediction model. Then, the fault prediction model is updated to the fault location device 300 . In this way, the accuracy of the fault prediction model can be ensured, so that the fault locating apparatus 300 can accurately determine whether the hard disk 410 has a fault risk by using the updated fault prediction model subsequently, so as to realize accurate fault location.
  • FIG. 4 is a schematic diagram of a failure prediction method for a hard disk 410 according to an embodiment of the present application.
  • the data collection device 100 collects the SMART attribute value of the SSD, and sends it to the model training device 200 after preprocessing.
  • the model training device 200 trains the VAE-LSTM model and determines the threshold.
  • the model training device 200 configures the VAE-LSTM model and the threshold into the fault location device 300, and the fault location device 300 obtains the preprocessed SMART attribute value of the SSD from the data acquisition device 100, and uses the VAE-LSTM model to output
  • For the abnormal score according to the comparison between the abnormal score and the threshold, it is determined that the SSD is running normally or there is a risk of failure.
  • the fault locating device 300 determines the cause of the failure of the hard disk 410 according to the abnormal score to realize fault location.
  • the model training apparatus 200 may also update the fault prediction model, and configure the updated fault prediction model into the fault location apparatus 300 .
  • step 305 (that is, determining the failure cause of the hard disk 410 according to the first abnormal score) can also be implemented by a model.
  • the model can be called a failure analysis model.
  • the fault analysis model can obtain the weight of each state value in the first abnormal score when the first abnormal score is greater than the threshold, determine the target state value, and then determine the component with fault.
  • the fault analysis model can be combined with the fault prediction model. Combined, the combined model is referred to as a fault location model here.
  • the model training device 200 can combine the fault analysis model into the fault prediction model to form a fault location model and configure it to the fault location device 300 . In this way, the fault locating apparatus 300 can use the fault locating model to determine the fault cause of the hard disk 410 .
  • the embodiments of the present application further provide a fault locating device, which is used to execute the method performed by the fault locating device method in the above method embodiments.
  • the fault location device 500 includes an acquisition unit 501 , a score determination unit 502 , and a risk determination unit 503 .
  • the obtaining unit 501 is configured to obtain a first attribute value of the hard disk, where the first attribute value is used to indicate the running status of multiple components in the hard disk.
  • the score determination unit 502 is configured to input the first attribute value into the failure prediction model, and output the first abnormal score, the first abnormal score is used to indicate the operating state of the hard disk under the first attribute value, and the failure prediction model is based on The second attribute value of the hard disk is generated by training, and the second attribute value is used to indicate the operating status of the multiple components when the hard disk is running normally.
  • the normal operation of the hard disk means that each component in the hard disk has no risk of failure, or within the allowable range, it is considered that each component in the hard disk can operate normally without causing the risk of failure of the hard disk.
  • the risk determination unit 503 is configured to determine that the hard disk has a risk of failure when the first abnormal score is greater than the threshold.
  • a training set may be formed by using the second attribute value, and based on the training set, the fault prediction model is trained in an unsupervised learning manner.
  • the output value obtained by inputting the second attribute value into the fault prediction model may be referred to as the second abnormal score
  • the threshold used for comparison with the first abnormal score may be based on the plurality of first abnormal scores. If the two abnormal scores are determined, for example, the threshold may be set to be not less than the maximum value among the plurality of second abnormal scores, or, for example, the threshold may be set equal to the quantile of the plurality of second abnormal scores.
  • the present application does not limit the structure of the fault prediction model, for example, the fault prediction model may include VAE and LSTM.
  • the first attribute value includes a plurality of status values, and one status value is used to indicate the running status of a component in the hard disk.
  • the first abnormal score is equal to the sum of the products of each state value in the first attribute value and the corresponding weight
  • the risk determination unit 503 may first determine from the plurality of state values according to the first abnormal score
  • the corresponding state value with the largest weight can be referred to as the target state value for the convenience of explanation; after that, the fault cause is determined according to the target state value, and the fault cause indicates that the component indicated by the target state value is faulty.
  • the score determination unit 502 when the score determination unit 502 inputs the first attribute value into the fault prediction model and outputs the first abnormal score, the first attribute value may be preprocessed first, and then the preprocessed value may be preprocessed.
  • the processed first attribute value is input into the fault prediction model, and the first abnormal score is output.
  • the preprocessing includes part or all of the following: screening processing and normalization processing.
  • the risk determination unit 503 indicates that the hard disk is running normally when the first abnormal score is not greater than the threshold.
  • the division of units in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may be other division methods.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit. In the device, it can also exist physically alone, or two or more units can be integrated into one module.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of software function modules.
  • the integrated unit if implemented in the form of software functional modules and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to make a terminal device (which may be a personal computer, a mobile phone, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
  • fault locating device 400 in the embodiment shown in FIG. 2 or 3 can take the form shown in FIG. 6 .
  • the computing device 600 shown in FIG. 6 includes at least one processor 601 , a memory 602 , and optionally, a communication interface 603 .
  • the memory 602 can be a volatile memory, such as random access memory; the memory can also be a non-volatile memory, such as read-only memory, flash memory, hard disk drive (HDD) or solid-state drive (solid-state drive, SSD), physical disk, or memory 602 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 602 may be a combination of the above-described memories.
  • connection medium between the above-mentioned processor 601 and the memory 602 is not limited in this embodiment of the present application.
  • the processor 601 can be a central processing unit (central processing unit, CPU), and the processor 601 can also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application specific integrated circuits (application specific integrated circuits. ASIC. ), field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, artificial intelligence chips, chips on a chip, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the processor 601 in FIG. 6 can execute the instructions by calling the computer stored in the memory 602 , so that the computing device can execute all the above method embodiments. The method performed by the fault location device 300 is described.
  • the functions/implementation processes of the acquiring unit, the score determining unit, and the risk determining unit in FIG. 5 can be implemented by the processor 601 in FIG. 6 calling the computer-executed instructions stored in the memory 602 .
  • the function/implementation process of the score determination unit and the risk determination unit in FIG. 5 can be implemented by the processor 601 in FIG. 6 calling the computer execution instructions stored in the memory 602, and the function/implementation of the acquisition unit in FIG. 5
  • the process may be implemented through the communication interface 603 in FIG. 6 .
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种针对硬盘的故障预测方法、装置及设备,故障定位装置先获取硬盘的第一属性值,将第一属性值输入至故障预测模型中,获得第一异常分值,第一异常分值指示第一属性值下硬盘的运行状态。该故障预测模型是基于硬盘正常运行时的第二属性值训练的,该第二属性值用于指示硬盘正常运行时多个组件的运行状态;之后,将第一异常分值与阈值进行比较,当第一异常分值大于阈值时,该硬盘虽然可以运行,但存在故障风险,在将来的某个时间可能发生故障。利用硬盘正常运行时的属性值对故障预测模型进行训练,提高故障预测模型的准确率。这样故障预测模型输出的输出值能够更加准确的反映出硬盘的运行状态,由此可以准确的确定硬盘是否存在故障风险。

Description

一种针对硬盘的故障预测方法、装置及设备 技术领域
本申请涉及存储技术领域,尤其涉及一种针对硬盘的故障预测方法、装置及设备。
背景技术
目前,存储系统中通常采用固态硬盘(solid state drive,SSD)作为主要的数据存储设备。数据存储设备故障率的高低影响着存储系统的可靠性。
为此需要及时排查数据存储设备中可能存在的故障、定位数据存储设备的故障原因,以及时对数据存储设备中的故障进行纠正以及处理。
故而主动的故障预测就显得尤为重要,但目前针对SSD的故障预测,准确程度较低,无法准确确定SSD是否存在潜在故障。
发明内容
本申请提供一种针对硬盘的故障预测方法、装置及设备,用以实现硬盘故障的准确预测。
第一方面,本申请实施例提供了一种针对硬盘的故障预测方法,该方法由故障定位装置执行,该方法中,故障定位装置可以先获取硬盘的第一属性值,该硬盘的第一属性值可以是硬盘未发生故障(也即该硬盘还能运行)之前的属性值,第一属性值用于指示硬盘中多个组件的运行状态。在获取了硬盘的第一属性值之后,故障定位装置可以将第一属性值输入至故障预测模型中,获得该第一属性值对应的输出值,本申请实施例中将该第一属性值对应的输出值称为第一异常分值,第一异常分值可以指示第一属性值下硬盘的运行状态。故障预测模型是预先训练的,该故障预测模型是基于硬盘正常运行时的第二属性值训练的;也即该故障预测模型并非基于硬盘故障时,或存在故障风险时的属性值训练的。该第二属性值用于指示硬盘正常运行时多个组件的运行状态;故障定位装置在获取第一异常分值之后,可以将第一异常分值与阈值进行比较,当第一异常分值大于阈值时,认为该硬盘虽然可以运行,但存在故障风险,在将来的某个时间可能发生故障。
通过上述方法,故障定位装置能够利用硬盘正常运行时的属性值对故障预测模型进行训练,提高故障预测模型的准确率。之后故障定位装置可以根据故障预测模型的输出值与阈值的比较确定该硬盘是否存在故障风险,准确率较高的故障预测模型输出的输出值能够更加准确的反映出硬盘的运行状态,由此可以准确的确定硬盘是否存在故障风险。
在一种可能的设计中,在利用第二属性值对故障预测模型进行训练时,可以利用第二属性值构建训练集,之后,基于该训练集,采用无监督学习方式对故障预测模型进行训练。
通过上述方法,利用无监督学习的方式,不需要考虑训练集样本的平衡性,能够保证故障预测模型的准确率,以便后续可以精确预测硬盘的故障风险。
在一种可能的设计中,第二属性值输入到故障预测模型后,可以获得第二属性值对应的输出值,该第二属性值对应的输出值可以称为第二异常分值。阈值可以是根据多个第二异常分值确定的,如该阈值可以不小于多个第二异常分值中的最大值,又如阈值可以等于该多个第二异常分值的分位数。
通过上述方法,根据第二异常分值确定的阈值能够较为清晰的区分出硬盘正常运行时与 硬盘存在故障风险时异常分值的界限,这样通过第一异常分值与该阈值比较确定的结果更加准确。
在一种可能的设计中,故障预测模型的结构有多种,例如,故障预测模型可以包括VAE和LSTM
通过上述方法,VAE和LSTM组合,能够使得故障预测模型一次性输入较多的第二属性值,故障预测模型可以在训练过程中学习到第二属性值之间的时序依赖关系。
在一种可能的设计中,第一属性值包括多个状态值,一个状态值用于指示硬盘中一个组件的运行状态,不同状态值可以指示硬盘中不同组件的运行状态。
通过上述方法,通过不同的状态值指示不同组件的运行状态,指示方式更加清晰,简单。
在一种可能的设计中,第一异常分值等于第一属性值中各个状态值与对应权重乘积的和值,故障定位装置除了能够确定硬盘存在故障风险,还能够根据第一异常分值确定硬盘中存在故障风险的组件,实现故障风险原因的定位。例如,故障定位装置可以根据第一异常分值,从多个状态值中确定对应的权重最大的目标状态值;之后,根据目标状态值确定故障风险原因,故障风险原因为目标状态值所指示的组件存在故障风险。
通过上述方法,除了保证故障预测的准确性,故障定位装置还能够确定存在故障风险的组件,有效的定位到了硬盘中存在故障风险的组件,给予维修人员指导意见,可以保证硬盘的故障风险能够提前解除。
在一种可能的设计中,故障定位装置在将第一属性值输入至故障预测模型中,输出第一异常分值时,可以先对第一属性值进行预处理,之后再将预处理后的第一属性值输入故障预测模型,输出第一异常分值,预处理包括下列的部分或全部:筛选处理、归一化处理。
通过上述方法,通过对第一属性值进行预处理,便于后续故障预测模型能够较为便捷的对第一属性值进行处理,较快获取第一异常分值,加快故障预测的效率。
在一种可能的设计中,在第一异常分值不大于阈值的情况下,故障定位装置还可以指示硬盘正常运行,以及时告知用户硬盘正常,提升用户体验。
第二方面,本申请提供了一种故障定位装置,该装置具有实现第一方面及第一方面任意一种可能的设计中所实现的功能。该装置功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软件包括一个或多个与上述功能相对应的单元。在一个可能的设计中,装置的结构中包括获取单元、分值确定单元以及风险确定单元,这些单元可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。
第三方面,本申请还提供了一种计算设备,有益效果可以参见第一方面及第一方面任意一种可能的设计的描述此处不再赘述。计算设备的结构中包括处理器和存储器,处理器被配置为执行上述第一方面及第一方面任意一种可能的设计的方法中相应的功能。存储器与处理器耦合,其保存故障定位装置必要的程序指令和数据。计算设备的结构中还包括通信接口,用于与其他设备进行通信,如接收第一属性值,或发送硬盘的故障原因等。
第四方面,本申请还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面的方法。
第五方面,本申请还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面的方法。
第六方面,本申请还提供一种计算机芯片,芯片与存储器相连,芯片用于读取并执行存储器中存储的软件程序,执行上述各方面的方法。
附图说明
图1为本申请提供的一种系统的架构示意图;
图2为本申请提供的一种模型训练方法示意图;
图3为本申请提供的一种针对硬盘的故障预测方法示意图;
图4为本申请提供的一种针对SSD的故障预测方法示意图;
图5为本申请提供的一种故障定位装置的结构示意图;
图6为本申请提供的一种计算设备的结构示意图。
具体实施方式
本申请实施例提供了一种故障预测方法,该方法可以用于预测能够运行的数据存储设备中是否存在故障风险,在本申请实施例中以数据存储设备是硬盘(如SSD或其他类型的硬盘)为例进行说明。应需理解的是,对于除硬盘外的、能够用于数据存储的数据存储设备也同样适用,具体实施方式与本申请实施例提供故障预测方式类似,具体可以参见本申请实施例的相关说明,此处不再赘述。
下面对本申请实施例适用的系统架构进行说明,如图1所示,为本申请实施例提供的一种系统架构示意图,该系统中包括数据采集装置100、模型训练装置200、以及故障定位装置300。
数据采集装置100可以与存储系统400连接,获取存储系统400中硬盘410的属性值(如第一属性值、第二属性值)。在本申请实施例中硬盘410的属性值,可以指示硬盘410中组件的运行状态。这里并不限定硬盘410中组件的类型与数量,组件包括下列的部分或全部:磁头、盘片、马达、电路、控制器、闪存芯片、固件。
硬盘410的属性值在指示硬盘410中组件的运行状态时,可以采用直接指示的方式,如通过组件的运行的繁忙系数指示该硬盘410中组件的运行状态,该组件的繁忙系数高说明该组件正积极运行、且效率较高;如通过盘片中的坏块数量指示硬盘410中存储区域的运行状态,盘片中的坏块数量较少时,说明盘片中存储区域正常运行、存储效率高。也可以采用间接指示的方式,如通过硬盘410中数据的错误量指示硬盘410中的组件的运行状态,如通过硬盘410中发生的不可纠正错误的数量描述硬盘410中盘片中存储区域的存储状态,如通过块编程数量指示描述硬盘410中盘片各个块的存储状态。
当属性值指示硬盘410中多个组件的运行状态时,属性值可以包括多个状态值,一个状态值用于指示一个组件的运行状态。
本申请并不限定数据采集装置100获取属性值的方式,例如,数据采集装置100可以与存储系统400中的管理设备连接,向管理设备请求该存储系统400中硬盘的属性值。
又例如,数据采集装置100可以直接连接存储系统400中的硬盘,获取硬盘的属性值。硬盘410中的属性值可以是硬盘410自身生成的,例如,硬盘410的各个组件上可以安装传感器,该传感器用于检测所在组件的运行状态,硬盘410中的处理单元可以通过安装在各个组件上的传感器获取该多个状态值,并将该多个状态值作为属性值发送给数据采集装置100。
举例来说,该属性值可以是自我监测、分析及报告技术(self-monitoring analysis and reporting technology,SMART)属性值。其中,SMART是一种自动的硬盘状态检测与预警系统和规范,SMART中,可以通过调用硬盘410内的检测指令对硬盘410的中各个组件运行状 态进行监控,并生成属性值。该属性值可以由硬盘410提供给数据采集装置100。本申请实施例并不限定数据采集装置100所连接的存储系统400的数量以及存储系统400中硬盘410的数量。图1中仅示例性的绘制出了两个存储系统400以及部分硬盘410。
数据采集装置100在获取硬盘410的属性值之后,可以将获取的硬盘410的属性值发送给模型训练装置200,模型训练装置200可以基于该硬盘410的属性值对故障预测模型进行训练,模型训练装置200在对故障预测模型训练完成后,可以将该故障预测模型配置给故障定位装置300。数据采集装置100在获取硬盘410的属性值之后,也可以将获取的硬盘410的属性值发送给故障定位装置300,故障定位装置300可以根据该硬盘410的属性值和故障预测模型实现故障定位,判断硬盘中是否存在故障风险,在确定硬盘中存在故障风险时,可以进一步确定硬盘410的故障原因,以确定硬盘410中存在故障风险的组件。
为了区分数据采集装置100发送给模型训练装置200的属性值以及发送给故障定位装置300的属性值,在本申请实施例中,将数据采集装置100发送给模型训练装置200的属性值称为第二属性值,将数据采集装置100发送给故障定位装置300的属性值称为第一属性值。
本申请实施例中并不限定数据采集装置100、模型训练装置200、以及故障定位装置300所部署的位置。对于数据采集装置100、模型训练装置200、以及故障定位装置300中的任一装置可以运行在云计算设备系统(包括至少一个云计算设备,例如:服务器等),也可以运行在边缘计算设备系统(包括至少一个边缘计算设备,例如:服务器、台式电脑等),也可以运行在各种终端计算设备上,例如:笔记本电脑、个人台式电脑等。例如,模型训练装置200、以及故障定位装置300可以部署在云计算设备系统或边缘计算设备系统,数据采集装置100可以部署在靠近存储系统400或硬盘410的终端计算设备上。又例如,数据采集装置100、模型训练装置200、以及故障定位装置300可以分别运行在云计算设备系统、边缘计算设备系统或终端计算设备这三个环境中。
数据采集装置100、模型训练装置200、以及故障定位装置300这三个装置可以为独立的硬件装置,之间通信通路连接。数据采集装置100、模型训练装置200、以及故障定位装置300这三个装置中的部分或全部也可以合设在一个硬件装置中,例如,模型训练装置200以及故障定位装置300可以合设为一个硬件装置,既能够实现故障预测模型的训练,又能够实现硬盘410的故障定位。例如,数据采集装置100、模型训练装置200以及故障定位装置300可以合设为一个硬件装置,兼具属性值采集、故障预测模型训练,以及硬盘410的故障定位的功能。
上述提及的硬件装置,本申请实施例并不限定具体形态,可以为服务器、服务集群,也可以为终端计算设备。
在本申请实施例提供的针对硬盘410的故障预测方法中需要借助故障预测模型,下面先对故障预测模型的训练方法进行说明,参见图2。
步骤201:数据采集装置100获取硬盘410的多个第二属性值,该第二属性值是在硬盘410正常运行时采集的,也就是说,第二属性值指示的是硬盘410正常运行情况下,硬盘410中组件的运行状态。
也就是说,数据采集装置100所采集的第二属性值均是硬盘410正常运行时的属性值,并非硬盘410故障时的属性值。需要说明的是,硬盘410正常运行是指硬盘中的各个组件不存在故障风险,也即硬盘不存在故障风险的情况下硬盘410运行。在一些可能的场景中,在允许范围内,硬盘中组件即便存在老化或轻微损坏,硬盘410仍可以运行,且不存在故障风 险,也即在硬盘410不易故障,这种场景,也可以认为硬盘410正常运行。
数据采集装置100可以获取硬盘410的多个第二属性值。例如,在不同时间段,获取该硬盘410的属性值,每个时间段获取的属性值为一个第二属性值。为了保证故障预测模型的准确性,数据采集装置100可以尽可能获取多的第二属性值。
需要说明的是,这里并不限定硬盘410的数量,可以为一个也可以为多个。也不限定硬盘410的类型,以硬盘为SSD为例,数据采集装置100可以获取不同型号的SSD的多个第二属性值。
步骤202:数据采集装置100将获取的硬盘410的多个第二属性值发送给模型训练装置200。
数据采集装置100在执行步骤202时,可以直接将获取的多个第二属性值发送模型训练装置200,也可以先对该多个第二属性值进行预处理,将进行了预处理后的多个第二属性值发送给模型训练装置200。
预处理的方式有许多种,下面列举其中两种,应需理解的是,对该多个第二属性值的其他预处理操作也同样适用于本申请实施例。
方式一、对多个第二属性值进行筛选。
例如,数据采集装置100可以去除该多个第二属性值中相同的属性值,又例如,数据采集装置100可以对每个第二属性值中包括的状态值进行筛选,如选择第二属性值中有效的状态值,如数据采集装置100可以保留第二属性值中记录不可纠正错误数量的状态值、记录块编程错误数量的状态值、以及记录新增坏块数的状态值。
方式二、对第二属性值进行归一化处理。
例如,当该第二属性值为单个数值,可以将第二属性值归一化为0~1区间的一个数值。又例如,当该第二属性值包括多个状态值时,可以将该多个状态值分别归一化为0~1区间的一个数值。
通过预处理,可以便于故障预测模型利用预处理后的第二属性值进行训练,简化训练流程。
步骤203:模型训练装置200在接收到该多个第二属性值后,可以利用该多个第二属性值构建训练集,利用该训练集对故障预测模型进行训练。模型训练装置200接收的多个第二属性值可以为经过预处理的多个第二属性值,利用该多个第二属性值构建训练集。模型训练装置200接收的多个第二属性值也可以为未经过预处理的多个第二属性值,模型训练装置200可以直接利用该多个第二属性值构建训练集。模型训练装置200可以在接收到该多个第二属性值后,对该多个第二属性模型进行预处理,关于预处理的说明,可以参见前述内容,此处不再赘述,模型训练装置200可以利用经过预处理的多个第二属性值构建训练集。
在执行步骤203时,模型训练装置200可以采用无监督学习的方式,对故障预测模型进行训练。其中,无监督学习是指训练集中的数据(如第二属性值)不设置标签,对训练集中数据本身的结构特性进行学习,实现分类。
模型训练装置200可以将每个第二属性值输入值故障预测模型,获得每个第二属性值对应的输出值。在无监督学习领域,每个第二属性值对应的输出值可以理解为对该多个第二属性值的分类结果。该多个第二属性值对应的输出值表征了在硬盘410正常运行下,硬盘410整体的运行状态。例如,当某一个属性值输入至该故障预测模型中该属性值对应的输出值,与该一个第二属性值对应的输出值相同、或处于该多个第二属性值对应的输出值中最大值和 最小值构成的范围内,说明该属性值表示硬盘410正常运行。为了方便说明,将故障预测模型对应的输出值称为异常分值,将第二属性值对应的输出值称为第二异常分值,可以指示在第二属性值下,硬盘410的运行状态。
本申请实施例并不限定故障预测模型的结构,凡是能够通过无监督学习的方式生成的能够用于实现故障预测的模型仅适用于本申请实施例。
举例来说,故障预测模型可以包括变分自动编码器(variational auto encoder,VAE)和长短时记忆网络(long short-term memory,LSTM)。VAE是一种深度生成模型,VAE包括编码网络和解码网络两部分。VAE可以将输入(如第二属性值)编码成隐空间中的随机变量(编码过程),然后再用解码网络将隐空间中的随机变量恢复成接近输入或与输入相同的数据(解码过程)。VAE以最大化重构数据为目标,使用由第二属性值构建的训练集进行时,在编码-解码的过程中学习硬盘410正常运行下的属性值(如第二属性值)的数据特征。这种数据特性体现在当VAE面对硬盘410正常运行下的属性值时重构误差较小,当面对硬盘410故障时的属性值时重构误差较大;重构误差可以体现为VAE输出的异常分值,重构误差越大,表示硬盘410故障程度越高,重构误差越小,硬盘410故障程度越低;基于训练集生成的各个第二异常分值集合确定阈值,当检测新的属性值时,如果该属性值的异常分值大于阈值,则判定为存在故障风险,否则硬盘410正常运行。为了捕捉的硬盘410属性值的时序依赖关系(也即硬盘410属性值在时间上的先后顺序),在构建故障预测模型时,可以在VAE中的编码网络和解码网络输入之前均加入LSTM,增强模型的表征能力。由于LSTM的输入窗口可以调整,通过调整输入窗口可以变更一次性处理的第二属性值的数量,当一次性输入多个第二属性值时,VAE中的编码网络和解码网络可以学习该多个第二属性值的时序依赖关系,最终确定该多个第二属性值的第二异常分值在一定程度下呈现渐变的趋势。
这是因为硬盘410的故障是从正常运行,逐渐过渡到故障的,通常并非突然发生故障,所以异常分值也存在一定渐变的趋势。
需要说明的是,故障预测模型输出的异常分值可以为输入的属性值(如第二属性值以及第一属性值)中各个状态值的加权和。也即每个状态值对应一个权重,各个状态值与对应权重的乘积的和等于该属性值的异常分值。故障预测模型输出异常分值的过程,可以看做为确定各个状态值的权重值、并求和的过程。
步骤204:模型训练装置200在对故障预测模型训练完成(如故障预测模型的损失函数收敛)后,可以向故障定位装置300发送该故障预测模型。
当故障定位装置300与模型训练装置200合设为一个硬件装置,该硬件装置在对该故障预测模型训练完成后,可以利用训练完成的该故障预测模型进行故障定位。
通过对故障预测模型的训练过程可知,在故障预测模型的训练集是由硬盘410正常运行时的属性值构成的,训练集的构建比较简单,不需要考虑硬盘410存在故障或存在故障风险时的属性值。也使得故障预测模型的训练过程变得更加简单、高效,训练完成的故障预测模型的准确率也更高。
通过上述方式故障预测模型配置在故障定位装置300中,故障定位装置300可以利用该故障预测模型实现故障定位,确定硬盘是否存在故障风险。下面结合附图3对本申请实施例提供的针对硬盘410的故障预测方法进行说明,参见图3,该方法包括:
步骤301:数据采集装置100获取硬盘410的第一属性值,该第一属性值可以是数据采集装置100采集的硬盘410当前运行的属性值。该硬盘410的第一属性值可以为硬盘410正 常运行下的属性值,也可以是硬盘410存在故障风险时的属性值。第一属性值可以表征硬盘410当前的运行状态,这种情况下,利用故障定位装置300可以实现故障的在线定位(也即实时定位)。第一属性值也可以表征硬盘410过去某个时间的运行状态,这种情况下,利用故障定位装置300可以实现故障的离线定位。
步骤302:数据采集装置100向故障定位装置300发送该第一属性值。数据采集装置100在发送该第一属性值之前,可以先对该第一属性值进行预处理,关于预处理的方式可以参见前述说明,此处不再赘述。
步骤303:故障定位装置300在接收到该第一属性值之后,将该第一属性值输入至故障预测模型,获得第一属性值对应的输出值,为方便说明,将第一属性值的输出值称为第一属性值的第一异常分值。第一异常分值表征了在第一属性值下硬盘410的运行状态。
故障定位装置300接收的该第一属性值可以是经过预处理的第一属性值,故障定位装置可以直接将该第一属性值输入至故障预测模型。故障定位装置300接收的该第一属性值也可以是未经过预处理的第一属性值,故障定位装置可以直接将该第一属性值输入至故障预测模型(对应在训练故障预测模型时,训练集中的第二属性值是未经过预处理的第二属性值的场景);故障定位装置300也可以在接收到该第一属性值后,对该第一属性进行预处理,将经过预处理的第一属性值输入至故障预测模型(对应在训练故障预测模型时,训练集中的第二属性值是经过预处理的第二属性值的场景),关于预处理的说明,可以参见前述内容,此处不再赘述,故障定位装置300可以将经过预处理的第一属性值输入至故障预测模型。
步骤304:故障定位装置300对第一异常分值与阈值进行比较,该阈值是根据第二异常分值确定的,该阈值可以不小于图2所示的实施例中获取的第二异常分值中的最大值,也可以是该多个第二异常分值的分位数。
步骤305:故障定位装置300在确定第一异常分值大于阈值的情况下,确定硬盘410存在故障风险。进一步的,故障定位装置300还可以根据第一异常分值确定硬盘410的故障原因。
在第一异常分值不大于阈值时,说明硬盘410运行正常。在第一异常分值大于阈值时,说明硬盘410存在故障风险,还可以继续确定故障原因。
由图2中对第二异常分值的说明可知,第一异常分值是多个状态值加权和,其中权重越大,将更容易导致第一异常分值大于阈值,故障定位装置300可以根据权重的大小从中确定权重较大的部分或全部状态值,该部分或全部状态值所指示的组件可能存在故障风险,导致硬盘410存在故障风险,也即硬盘410的故障原因为该部分或全部状态值所指示的组件存在故障风险。
以第一属性值中包括的状态值的数量等于K为例,故障定位装置300可以从中确定权重最大的状态值,或确定权重由大到小排序后,位于前N位的状态值(其中,K>N,K、N为正整数),将确定的状态值称为目标状态值。N可以是经验值。进而故障定位装置300可以确定该目标状态值所描述的组件存在故障风险,是硬盘410的故障原因。
从图3所示的实施例可以看出,故障定位装置300在根据第一属性值判断硬盘410是否存在故障时,借助了预先训练的故障预测模型;由于故障预测模型是利用硬盘410正常运行时的属性值训练的,当在故障预测时,故障定位装置300就根据故障预测模型的输出值与阈值的比较可以准确的确定出硬盘410是否存在故障风险。
在本申请实施例中,允许对故障预测模型进行更新,例如,模型训练装置200可以通过 数据采集装置100获取新的第二属性值,继续对故障预测模型进行训练,在训练完成后再将故障预测模型更新到故障定位装置300中。故障预测模型的更新可以采用在线更新的方式,也就是说,模型训练装置200可以在故障定位装置300运行的过程中,获取新的第二属性值,继续对故障预测模型进行训练,在训练完成后,将故障预测模型更新到故障定位装置300中。这样可以保证故障预测模型的准确性,使得故障定位装置300后续利用更新后的故障预测模型能够准确判断硬盘410是否存在故障风险,实现准确的故障定位。
参见图4为本申请实施例针对硬盘410的故障预测方法的示意图,图4中,数据采集装置100采集SSD的SMART属性值,经过预处理后,发送至模型训练装置200。模型训练装置200对VAE-LSTM模型进行训练,并确定阈值。在训练完成后,模型训练装置200将VAE-LSTM模型和阈值配置到故障定位装置300中,故障定位装置300从数据采集装置100获取经过预处理的SSD的SMART属性值,利用VAE-LSTM模型输出异常分值,根据异常分值与阈值的比较,确定SSD正常运行或存在故障风险,在存在故障风险的情况下,故障定位装置300根据异常分值确定硬盘410的故障原因,实现故障定位。模型训练装置200也可以对故障预测模型进行更新,将更新后的故障预测模型配置到故障定位装置300中。
需要说明的是,在本申请实施例中,步骤305(也即根据第一异常分值确定硬盘410的故障原因)也可以通过模型实现,为方便说明,该模型可以称为故障分析模型。该故障分析模型可以在第一异常分值大于阈值的情况下,获取第一异常分值中各个状态值的权重,确定目标状态值,进而确定存在故障的组件,故障分析模型可以与故障预测模型合并,这里将合并后的模型称为故障定位模型,模型训练装置200在对故障预测模型训练完成后,可以将故障分析模型合并到故障预测模型,形成故障定位模型,配置到故障定位装置300。这样故障定位装置300可以利用故障定位模型确定硬盘410的故障原因。
基于与方法实施例同一发明构思,本申请实施例还提供了一种故障定位装置,用于执行上述方法实施例中故障定位装置方法执行的方法,相关特征可参见上述方法实施例,此处不再赘述,如图5所示,该故障定位装置500包括获取单元501、分值确定单元502、风险确定单元503。
获取单元501,用于获取硬盘的第一属性值,第一属性值用于指示硬盘中多个组件的运行状态。
分值确定单元502,用于将第一属性值输入至故障预测模型中,输出第一异常分值,第一异常分值用于指示第一属性值下硬盘的运行状态,故障预测模型是基于硬盘的第二属性值训练生成的,第二属性值用于指示硬盘正常运行时多个组件的运行状态。在本申请实施例中硬盘正常运行是指,硬盘中的各个组件不存在故障风险,或在允许范围内,认为硬盘中的各个组件能够正常运转,不会导致硬盘存在故障风险。
风险确定单元503,用于在第一异常分值大于阈值的情况下,确定硬盘存在故障风险。
作为一种可能的实施方式,在训练故障预测模型时,可以利用第二属性值构成训练集,基于该训练集,采用无监督学习方式对故障预测模型进行训练。
作为一种可能的实施方式,第二属性值输入到故障预测模型获取的输出值,可以称为第二异常分值,与第一异常分值进行比较所采用的阈值可以是根据该多个第二异常分值确定的,例如,可以设置该阈值不小于多个第二异常分值中的最大值,又例如可以设置阈值等于该多个第二异常分值的分位数。
作为一种可能的实施方式,本申请并不限定故障预测模型的结构,例如该故障预测模型 可以包括VAE和LSTM。
作为一种可能的实施方式,第一属性值包括多个状态值,一个状态值用于指示硬盘中一个组件的运行状态。
作为一种可能的实施方式,第一异常分值等于第一属性值中各个状态值与对应权重乘积的和值,风险确定单元503可以先根据第一异常分值,从多个状态值中确定对应的权重最大的状态值,为方便说明该状态值可以称为目标状态值;之后,再根据目标状态值确定故障原因,故障原因指示目标状态值所指示的组件存在故障。
作为一种可能的实施方式,分值确定单元502在将第一属性值输入至故障预测模型中,输出第一异常分值时,可以先对第一属性值进行预处理,之后,再将预处理后的第一属性值输入故障预测模型,输出第一异常分值,预处理包括下列的部分或全部:筛选处理、归一化处理。
作为一种可能的实施方式,风险确定单元503在第一异常分值不大于阈值的情况下,指示硬盘正常运行。
本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本申请各个实施例中的各功能单元可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上单元集成为一个模块中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
该集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台终端设备(可以是个人计算机,手机,或者网络设备等)或处理器(processor)执行本申请各个实施例该方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
在一个简单的实施例中,本领域的技术人员可以想到如图2或3所示的实施例中故障定位装置400可采用图6所示的形式。
如图6所示的计算设备600,包括至少一个处理器601、存储器602,可选的,还可以包括通信接口603。
存储器602可以是易失性存储器,例如随机存取存储器;存储器也可以是非易失性存储器,例如只读存储器,快闪存储器,硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)、物理磁盘或者存储器602是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器602可以是上述存储器的组合。
本申请实施例中不限定上述处理器601以及存储器602之间的具体连接介质。
处理器601可以为中央处理器(central processing unit,CPU),该处理器601还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circui。ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件、人工智能芯片、片上芯片等。通用处理器可以是微处理器或者是任何常规的 处理器等。处理器601在与其他设备进行通信时,可以通过通信接口603进行数据传输,如从接收第一检测指令或第二检测指令。
当所述故障定位装置采用图6所示的形式时,图6中的处理器601可以通过调用存储器602中存储的计算机执行指令,使得所述计算设备可以执行上述任一方法实施例中的所述故障定位装置300执行的方法。
具体的,图5的获取单元、分值确定单元、以及风险确定单元的功能/实现过程均可以通过图6中的处理器601调用存储器602中存储的计算机执行指令来实现。或者,图5中的分值确定单元、以及风险确定单元的功能/实现过程可以通过图6中的处理器601调用存储器602中存储的计算机执行指令来实现,图5的获取单元的功能/实现过程可以通过图6中的通信接口603来实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (18)

  1. 一种针对硬盘的故障预测方法,其特征在于,所述方法包括:
    获取硬盘的第一属性值,所述第一属性值用于指示所述硬盘中多个组件的运行状态;
    将所述第一属性值输入至故障预测模型中,输出第一异常分值,所述第一异常分值用于指示所述第一属性值下所述硬盘的运行状态,所述故障预测模型是基于所述硬盘的第二属性值训练生成的,所述第二属性值用于指示所述硬盘正常运行时多个组件的运行状态;
    在所述第一异常分值大于阈值的情况下,确定所述硬盘存在故障风险。
  2. 如权利要求1所述的方法,其特征在于,所述故障预测模型是基于包括所述第二属性值的训练集,采用无监督学习方式训练生成的。
  3. 如权利要求2所述的方法,其特征在于,所述阈值是根据多个第二异常分值确定的,所述第二异常分值是所述第二属性值输入至所述故障预测模型中获得的输出值。
  4. 如权利要求1~3任一所述的方法,其特征在于,所述故障预测模型包括变分自动编码器VAE和长短时记忆网络LSTM。
  5. 如权利要求1~4任一所述的方法,其特征在于,所述第一属性值包括多个状态值,一个所述状态值用于指示所述硬盘中一个组件的运行状态。
  6. 如权利要求5所述的方法,其特征在于,所述第一异常分值等于所述第一属性值中各个所述状态值与对应权重乘积的和值,所述方法还包括,包括:
    根据所述第一异常分值,从所述多个状态值中确定对应的权重最大的目标状态值;
    根据所述目标状态值确定所述故障原因,所述故障原因指示所述目标状态值所指示的组件存在故障风险。
  7. 如权利要求1~6任一所述的方法,其特征在于,所述将所述第一属性值输入至故障预测模型中,输出第一异常分值,包括:
    对所述第一属性值进行预处理,将预处理后的所述第一属性值输入所述故障预测模型,输出所述第一异常分值,所述预处理包括下列的部分或全部:筛选处理、归一化处理。
  8. 如权利要求1~7任一所述的方法,其特征在于,所述方法还包括:
    在所述第一异常分值不大于所述阈值的情况下,指示所述硬盘正常运行。
  9. 一种故障定位装置,其特征在于,所述装置包括获取单元、分值确定单元、风险确定单元;
    所述获取单元,用于获取硬盘的第一属性值,所述第一属性值用于指示所述硬盘中多个组件的运行状态;
    所述分值确定单元,用于将所述第一属性值输入至故障预测模型中,输出第一异常分值,所述第一异常分值用于指示所述第一属性值下所述硬盘的运行状态,所述故障预测模型是基于所述硬盘的第二属性值训练生成的,所述第二属性值用于指示所述硬盘正常运行时多个组件的运行状态;
    所述风险确定单元,用于在所述第一异常分值大于阈值的情况下,确定所述硬盘存在故障风险。
  10. 如权利要求9所述的装置,其特征在于,所述故障预测模型是基于包括所述第二属性值的训练集,采用无监督学习方式训练生成的。
  11. 如权利要求10所述的装置,其特征在于,所述阈值是根据多个第二异常分值确定的,所述第二异常分值是所述第二属性值输入至所述故障预测模型中获得的输出值;所述阈值不小于所述多个第二异常分值中的最大值。
  12. 如权利要求9~11任一所述的装置,其特征在于,所述故障预测模型包括变分自动编码器VAE和长短时记忆网络LSTM。
  13. 如权利要求9~12任一所述的装置,其特征在于,所述第一属性值包括多个状态值,一个所述状态值用于指示所述硬盘中一个组件的运行状态。
  14. 如权利要求13所述的装置,其特征在于,所述第一异常分值等于所述第一属性值中各个所述状态值与对应权重乘积的和值,所述风险确定单元,还用于:
    根据所述第一异常分值,从所述多个状态值中确定对应的权重最大的目标状态值;
    根据所述目标状态值确定所述故障原因,所述故障原因指示所述目标状态值所指示的组件存在故障风险。
  15. 如权利要求9~14任一所述的装置,其特征在于,所述分值确定单元在将所述第一属性值输入至故障预测模型中,输出第一异常分值时,具体用于:
    对所述第一属性值进行预处理,将预处理后的所述第一属性值输入所述故障预测模型,输出所述第一异常分值,所述预处理包括下列的部分或全部:筛选处理、归一化处理。
  16. 如权利要求9~15任一所述的装置,其特征在于,所述风险确定单元,还用于:
    在所述第一异常分值不大于所述阈值的情况下,指示所述硬盘正常运行。
  17. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;
    所述存储器,用于存储计算机程序指令;
    所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求1至8中任一项所述的方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行权利要求1至8任一项所述的方法。
PCT/CN2021/142559 2021-02-08 2021-12-29 一种针对硬盘的故障预测方法、装置及设备 WO2022166481A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110172914.6A CN114943321A (zh) 2021-02-08 2021-02-08 一种针对硬盘的故障预测方法、装置及设备
CN202110172914.6 2021-02-08

Publications (1)

Publication Number Publication Date
WO2022166481A1 true WO2022166481A1 (zh) 2022-08-11

Family

ID=82740814

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142559 WO2022166481A1 (zh) 2021-02-08 2021-12-29 一种针对硬盘的故障预测方法、装置及设备

Country Status (2)

Country Link
CN (1) CN114943321A (zh)
WO (1) WO2022166481A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115729761A (zh) * 2022-11-23 2023-03-03 中国人民解放军陆军装甲兵学院 一种硬盘故障预测方法、系统、设备及介质
CN116069590A (zh) * 2023-01-09 2023-05-05 黑龙江愚公软件科技有限公司 一种基于人工智能的计算机数据安全监测系统及方法
CN117290147A (zh) * 2023-10-13 2023-12-26 鑫硕泰(深圳)科技有限公司 一种固态硬盘故障修复方法、装置以及电子设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171768B (zh) * 2022-09-05 2022-12-02 北京得瑞领新科技有限公司 提升ssd不良品分析效率的方法、装置、存储介质及设备
CN116610469B (zh) * 2023-07-21 2023-11-14 江苏华存电子科技有限公司 一种固态硬盘的综合质量性能测试方法及系统
CN117093433B (zh) * 2023-10-18 2024-02-09 苏州元脑智能科技有限公司 故障检测方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767162A (zh) * 2020-05-20 2020-10-13 北京大学 一种面向不同型号硬盘的故障预测方法及电子装置
CN111782491A (zh) * 2019-11-15 2020-10-16 华中科技大学 一种磁盘故障预测方法、装置、设备及存储介质
CN111881000A (zh) * 2020-08-07 2020-11-03 广州云从博衍智能科技有限公司 一种故障预测方法、装置、设备及机器可读介质
US20200380336A1 (en) * 2019-06-03 2020-12-03 Dell Products L.P. Real-Time Predictive Maintenance of Hardware Components Using a Stacked Deep Learning Architecture on Time-Variant Parameters Combined with a Dense Neural Network Supplied with Exogeneous Static Outputs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200380336A1 (en) * 2019-06-03 2020-12-03 Dell Products L.P. Real-Time Predictive Maintenance of Hardware Components Using a Stacked Deep Learning Architecture on Time-Variant Parameters Combined with a Dense Neural Network Supplied with Exogeneous Static Outputs
CN111782491A (zh) * 2019-11-15 2020-10-16 华中科技大学 一种磁盘故障预测方法、装置、设备及存储介质
CN111767162A (zh) * 2020-05-20 2020-10-13 北京大学 一种面向不同型号硬盘的故障预测方法及电子装置
CN111881000A (zh) * 2020-08-07 2020-11-03 广州云从博衍智能科技有限公司 一种故障预测方法、装置、设备及机器可读介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115729761A (zh) * 2022-11-23 2023-03-03 中国人民解放军陆军装甲兵学院 一种硬盘故障预测方法、系统、设备及介质
CN115729761B (zh) * 2022-11-23 2023-10-20 中国人民解放军陆军装甲兵学院 一种硬盘故障预测方法、系统、设备及介质
CN116069590A (zh) * 2023-01-09 2023-05-05 黑龙江愚公软件科技有限公司 一种基于人工智能的计算机数据安全监测系统及方法
CN116069590B (zh) * 2023-01-09 2023-10-24 甘肃昊润科技信息有限公司 一种基于人工智能的计算机数据安全监测系统及方法
CN117290147A (zh) * 2023-10-13 2023-12-26 鑫硕泰(深圳)科技有限公司 一种固态硬盘故障修复方法、装置以及电子设备

Also Published As

Publication number Publication date
CN114943321A (zh) 2022-08-26

Similar Documents

Publication Publication Date Title
WO2022166481A1 (zh) 一种针对硬盘的故障预测方法、装置及设备
US10805151B2 (en) Method, apparatus, and storage medium for diagnosing failure based on a service monitoring indicator of a server by clustering servers with similar degrees of abnormal fluctuation
US11392826B2 (en) Neural network-assisted computer network management
US10147048B2 (en) Storage device lifetime monitoring system and storage device lifetime monitoring method thereof
US10157105B2 (en) Method for data protection for cloud-based service system
US20160299938A1 (en) Anomaly detection system and method
CN108052528A (zh) 一种存储设备时序分类预警方法
WO2021213247A1 (zh) 一种异常检测方法及装置
CN110164501B (zh) 一种硬盘检测方法、装置、存储介质及设备
CN112433896B (zh) 一种服务器磁盘故障预测方法、装置、设备及存储介质
CN111242323A (zh) 用于修理机器次优操作的主动自动系统和方法
US11966214B2 (en) Industrial internet of things systems for intelligent repair of manufacturing equipment and control methods thereof
US20190355434A1 (en) Memory system quality integral analysis and configuration
WO2023169274A1 (zh) 数据处理方法、装置、存储介质以及处理器
WO2021126398A1 (en) Behavior-driven die management on solid-state drives
US20220368614A1 (en) Method and system for anomaly detection based on time series
US11841295B2 (en) Asset agnostic anomaly detection using clustering and auto encoder
US9645873B2 (en) Integrated configuration management and monitoring for computer systems
CN117251114A (zh) 模型训练方法、磁盘寿命预测方法、相关装置及设备
KR20220068799A (ko) 자동화설비의 고장 검출 시스템 및 그 방법
CN117194145A (zh) 异常客户端检测方法、装置、电子设备及存储介质
CN116991615A (zh) 一种基于在线学习的云原生系统故障自愈方法及装置
CN116820821A (zh) 磁盘故障检测方法、装置、电子设备及计算机可读存储介质
KR102420514B1 (ko) 딥러닝을 이용한 하드웨어 및 소프트웨어 결함 탐지 방법 및 분석장치
CN109491844B (zh) 一种识别异常信息的计算机系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21924474

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21924474

Country of ref document: EP

Kind code of ref document: A1