CN112214369A - Hard disk fault prediction model establishing method based on model fusion and application thereof - Google Patents

Hard disk fault prediction model establishing method based on model fusion and application thereof Download PDF

Info

Publication number
CN112214369A
CN112214369A CN202011147445.4A CN202011147445A CN112214369A CN 112214369 A CN112214369 A CN 112214369A CN 202011147445 A CN202011147445 A CN 202011147445A CN 112214369 A CN112214369 A CN 112214369A
Authority
CN
China
Prior art keywords
model
hard disk
feature
features
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011147445.4A
Other languages
Chinese (zh)
Inventor
陈俭喜
冯丹
陈彧
陈鑫宇
马莉珍
郑梦丽
董深育
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202011147445.4A priority Critical patent/CN112214369A/en
Publication of CN112214369A publication Critical patent/CN112214369A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • G06F11/261Functional testing by simulating additional hardware, e.g. fault simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a hard disk failure prediction model establishing method based on model fusion and application thereof, belonging to the technical field of computer storage and comprising the following steps: extracting basic features from historical data of SMART information of the hard disk, constructing new features, and then screening an optimal feature subset; constructing the characteristics corresponding to each piece of historical data according to the screening result, forming a sample with the corresponding hard disk state to obtain a training data set, and then dividing the training data set into a training set and a test set; establishing a plurality of different machine learning models to obtain a plurality of base models; executing the sub-model establishing step for each basic model for multiple times, integrating the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, and performing parameter tuning and evaluation on the hard disk fault prediction model by using a test set; the sub-model establishing steps are as follows: randomly selecting partial features from the optimal feature subset, and training a single basic model by using a training set to obtain a sub-model; only the selected features are used as input during training. The invention can improve the accuracy of hard disk failure prediction.

Description

Hard disk fault prediction model establishing method based on model fusion and application thereof
Technical Field
The invention belongs to the technical field of computer storage, and particularly relates to a hard disk fault prediction model building method based on model fusion and application thereof.
Background
With the development of technologies such as big data and cloud computing in recent years, a large data center is established by many internet enterprises such as microsoft, google, and arbiba to provide cloud services for users. However, as the number of users increases and the storage scale increases suddenly, various failures occur in the data center, wherein the hard disk is used as a main device for storing data, compared with other devices, the hard disk has the characteristics of large quantity and short service life, and the failures of the hard disk can greatly affect the reliability and the user experience of the data center. Therefore, early prediction of hard disk failures is of great value to the enterprise.
The existing method for improving the reliability of the data center mainly comprises two aspects of active fault tolerance and passive fault tolerance. Passive fault tolerance mainly includes erasure code, backup and other technologies, which need to spend a large amount of cost, and at the same time, as the amount of data increases, the cost will become larger and larger, so the use of the technology is also limited to a certain extent. Compared with passive fault tolerance, active fault tolerance has obvious advantages in cost. One commonly used active fault tolerance method is to periodically monitor the SMART (Self-Monitoring Analysis and Reporting Technology) attribute of the hard disk, which when it reaches a specified threshold indicates that the hard disk may fail. Another common active fault tolerance method is a method combining machine learning, and considers hard disk faults as a binary problem, and adopts historical SMART data to perform modeling, and predicts newly acquired data, thereby determining whether the hard disk faults.
In the active fault-tolerant technology, the hard disk fault is judged to be too single only through monitoring of the SMART attribute, and the accuracy is low. The traditional machine learning method generally adopts a single model to solve the problem, and simultaneously only adopts the SMART attribute of the hard disk, does not well excavate the relationship between the attributes, and cannot be well applied to the actual production environment. In general, the accuracy of the existing active fault-tolerant technology for hard disk fault prediction needs to be further improved.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a hard disk failure prediction model establishing method based on model fusion and application thereof, and aims to improve the accuracy of hard disk failure prediction.
In order to achieve the above object, according to an aspect of the present invention, there is provided a method for establishing a hard disk failure prediction model based on model fusion, including:
characteristic engineering steps: extracting basic features from historical data of hard disk SMART information periodically collected from a data center, constructing new features through feature engineering, and selecting partial features with highest hard disk failure prediction precision from all the features to obtain an optimal feature subset;
a data set construction step: according to the optimal feature subset, constructing a feature corresponding to each piece of data in the historical data, wherein the feature of each piece of data and the corresponding hard disk state form a sample, and all the samples form a training data set; dividing a training data set into a training set and a test set;
establishing a base model: establishing a plurality of different machine learning models which are respectively used for carrying out fault prediction according to the characteristic data of the hard disk, wherein each machine learning model is respectively used as a base model;
establishing a sub-model: for a single base model, randomly selecting partial features from the optimal feature subset according to a specified proportion, and training the base model by using a training set to obtain a sub-model; in the training process, only the selected features in each sample are taken as input;
model fusion step: and respectively executing the sub-model establishing step for multiple times for each base model, integrating all the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, fusing prediction results output by all the sub-models to be used as the prediction result of the hard disk fault prediction model, and performing parameter optimization and evaluation on the fault prediction model by using the test set.
The method extracts basic features from the SMART information, constructs new features through feature engineering on the basis, and can excavate the relationship inside and between the original SMART features and introduce meaningful new features; on the basis of all the characteristics, an optimal characteristic subset is further screened out, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction; a plurality of independent submodels with strong prediction capability are generated based on different machine learning models, then the outputs of the submodels are fused, the generalization capability of the final model to different data sets is enhanced, and the hard disk failure detection rate is effectively improved. Therefore, the invention can effectively improve the accuracy of hard disk fault detection.
Further, in the sub-model establishing step, training the base model by using a training set, including:
randomly undersampling the good disc samples in the training set, forming a new training set by using the good disc samples obtained by sampling and all bad disc samples in the training set, and training the base model by using the samples in the new training set;
randomly undersampling the samples of the training set with good discs, comprising:
randomly sampling the good discs related to the training set according to a preset sampling proportion;
and for each good disk obtained by sampling, randomly selecting a sample from the samples, and forming a random undersampling result by all the randomly selected good disk samples.
Because the hard disk data set is a data set with unbalanced positive and negative samples, good disk samples are far more than bad disk samples, the invention undersamples the good disk samples, and the good disk samples and all the bad disk samples obtained by sampling are used as data sets for training the basic model, thereby effectively reducing the influence of sample inclination on the prediction result; in the undersampling process, all good disks are sampled, and then one good disk sample is randomly selected from the good disk samples obtained by sampling, so that the distribution range of the good disk samples can be ensured to be as large as possible, and the prediction accuracy and robustness of the model can be further ensured.
Further, the random seed of the base model is different each time the sub-model building step is performed, thereby ensuring the difference between all sub-models.
Further, in the step of establishing the base model, the established machine learning models are respectively: catboost, Xgboost and LightGBM.
When the failure prediction is carried out on the hard disk, the Catboost, the Xgboost and the LightGBM have higher training speed and higher prediction accuracy compared with other machine learning models, so that the method takes the Catboost, the Xgboost and the LightGBM as base models and can obtain better prediction effect.
Further, in the feature engineering step, constructing a new feature through feature engineering, including:
for each basic feature, calculating one or more statistical features of the basic feature, and taking each statistical feature as a new feature;
wherein the statistical features include a maximum value, and/or a mean value.
Many indexes in the hard disk, such as data write-in quantity, check error number and the like, are changed and gradually increased along with time, the maximum value and the average value of the same index of a good disk and a bad disk may be greatly different in many cases, for example, the maximum value of the check error number of the bad disk may be many times that of the good disk, and therefore, the statistical characteristics have a large influence on the prediction result of the model; according to the method, statistical information such as the maximum value, the mean value and the like of basic features are used as new features through feature engineering, and the prediction accuracy of the model is improved.
Further, in the feature engineering step, constructing a new feature through feature engineering, including:
for each base feature, the original value is divided by the normalized value as the new feature.
For each feature of the hard disk, the manufacturer provides a raw value (original value) and a normalized value (normalized value), the normalized value is the result of some normalization operation on the raw value, meaning that there should be a potential relationship between the two to be mined, and the normalized values are both within the interval [0,1], so that the two can be divided to generate a new feature. According to the method, the original value of the basic feature is divided by the normalized value to serve as a new feature, so that the relationship among the features can be further excavated, and the prediction accuracy of the model can be improved.
Further, in the feature engineering step, a packaging method is used for selecting partial features which enable the hard disk fault prediction accuracy to be highest from all the features, and when the packaging method is executed, the adopted selection model is one of the established base models.
The invention screens out the optimal characteristic subset from all the characteristics by using an encapsulation method, can find the relation between the hard disk state information and the fault on the basis of dimension reduction, and further improves the prediction accuracy of the model.
Further, in the step of feature engineering, before constructing a new feature by feature engineering, the method further includes:
and after data cleaning is carried out on the historical data, the characteristic that the difference between the maximum value and the minimum value does not exceed a preset threshold value is taken as a basically unchangeable characteristic to be removed.
Because some missing values and abnormal values inevitably occur in the data set, the model cannot be well constructed due to the existence of the values, the abnormal values can be eliminated through data cleaning, and the missing values are filled, so that the modeling effect is ensured; in the invention, the main purpose of the characteristic engineering is to find out the index which is easy to distinguish a good disc from a bad disc so as to bring a good modeling effect, if the value of the characteristic does not change in the good disc or the bad disc for a long time, the characteristic is not useful for modeling, and the characteristic with basically unchanged values can effectively reduce the characteristic dimension and reduce the calculated amount.
According to another aspect of the present invention, a hard disk failure prediction method based on model fusion is provided, including:
for the real-time data of the SMART information of the hard disk collected from the data center, constructing the corresponding characteristics of the real-time data according to the optimal characteristic subset obtained by the hard disk failure prediction method based on model fusion provided by the invention;
respectively inputting the characteristics corresponding to the real-time data into each sub-model in the hard disk fault prediction model obtained by the hard disk fault prediction method based on model fusion provided by the invention;
and performing soft voting on the prediction result of each sub-model, and taking the soft voting result as a final hard disk failure prediction result.
The hard disk failure prediction model established by the hard disk failure prediction method based on model fusion has higher prediction accuracy, and the hard disk failure prediction method based on model fusion has higher prediction accuracy based on the model, and can obtain better prediction results by adopting soft voting when the prediction results output by each sub-model are fused.
According to yet another aspect of the present invention, there is provided a computer readable storage medium comprising a stored computer program;
when the computer program is executed by the processor, the device where the computer readable storage medium is located is controlled to execute the hard disk failure prediction model building method based on model fusion provided by the invention and/or the hard disk failure prediction method based on model fusion provided by the invention.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the method extracts basic features from the SMART information, constructs new features through feature engineering on the basis, and can excavate the relationship inside and between the original SMART features and introduce meaningful new features; on the basis of all the characteristics, an optimal characteristic subset is further screened out, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction; a plurality of independent submodels with strong prediction capability are generated based on different machine learning models, then the outputs of the submodels are fused, the generalization capability of the final model to different data sets is enhanced, and the hard disk failure detection rate is effectively improved. Therefore, the invention can effectively improve the accuracy of hard disk fault detection.
(2) According to the method, through feature engineering, statistical information of the SMART features and a result of dividing the original value by the normalized value are used as new features, the relation among the features can be fully excavated, and the accuracy of the model for predicting the hard disk fault is further improved.
(3) Before the new characteristics are constructed through the characteristic engineering, the data are cleaned, the characteristics which are basically unchanged are filtered, and the accuracy of the model for predicting the hard disk faults can be further ensured.
(4) The invention fuses the prediction results output by each sub-model in a soft voting mode, and can obtain better prediction effect.
Drawings
Fig. 1 is a flowchart of a method for establishing a hard disk failure prediction model based on model fusion according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a model fusion method according to an embodiment of the present invention;
FIG. 3 is a flowchart of screening an optimal feature subset according to an embodiment of the present invention;
fig. 4 is a flowchart of a hard disk failure prediction method based on model fusion according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Before explaining the technical scheme of the invention in detail, the SMART information is briefly introduced as follows:
SMART is a Self-Monitoring, Analysis and Reporting Technology, which is an automatic hard disk state Monitoring and early warning system and specification, monitors and records the operating states of hardware such as a head, a disk, a motor and a circuit of a hard disk through detection instructions in the hard disk, and compares the operating states with preset safety knowledge set by a manufacturer. If the monitoring condition is about to exceed the safety range of the preset safety value, the monitoring hardware or software of the host can automatically warn the user and slightly automatically repair the user so as to ensure the safety of the hard disk data in advance. The SMART attribute has become a main index for monitoring the state of the hard disk in recent years, and in the following embodiments, the hard disk failure prediction is also performed based on the SMART data. The following are examples.
Example 1:
a method for establishing a hard disk failure prediction model based on model fusion is disclosed, as shown in FIG. 1, and comprises the following steps:
characteristic engineering steps:
extracting basic features from historical data of hard disk SMART information periodically collected from a data center, constructing new features through feature engineering, and selecting partial features with highest hard disk failure prediction precision from all the features to obtain an optimal feature subset;
the time interval for acquiring the hard disk SMART information from the data center can be determined according to the actual situation of the data center, and in the embodiment, the hard disk SMART information is acquired from the data center every day; in this embodiment, the extracting of the attribute as the basic feature from the SMART information mainly includes: the bottom layer data reading error rate, the starting time of the disk, the number of the relocatable magnetic area, the reported uncorrectable errors, the number of uncorrectable sectors and the like;
a data set construction step:
according to the optimal feature subset, constructing a feature corresponding to each piece of data in the historical data, wherein the feature of each piece of data and the corresponding hard disk state form a sample, and all the samples form a training data set; dividing a training data set into a training set and a test set;
in the embodiment, one piece of data in the historical data is SMART data of one hard disk for one day; optionally, in this embodiment, the training data set is divided into a training set and a test set according to a ratio of 7: 3;
establishing a base model:
establishing a plurality of different machine learning models which are respectively used for carrying out fault prediction according to the characteristic data of the hard disk, wherein each machine learning model is respectively used as a base model;
as a preferred implementation, in this embodiment, the machine learning models established are: catboost, Xgboost and LightGBM; when the hard disk is subjected to fault prediction, the Catboost, the Xgboost and the LightGBM have higher training speed and higher prediction accuracy compared with other machine learning models, so that the better prediction effect can be obtained by taking the Catboost, the Xgboost and the LightGBM as base models; it should be noted that the base models established here are only preferred embodiments of the present invention, and in some other embodiments of the present invention, the type and number of the established base models may be different from those of the present embodiment;
establishing a sub-model:
for a single base model, randomly selecting partial features from the optimal feature subset according to a specified proportion, and training the base model by using a training set to obtain a sub-model; in the training process, only the selected features in each sample are taken as input;
the proportion of the selected features is denoted as p, and accordingly, the number of features selected from the optimal feature subset can be expressed as
Figure BDA0002740117430000081
Wherein N is the total number of features in the optimal feature subset,
Figure BDA0002740117430000082
represents rounding down; the proportion p belongs to a hyper-parameter and can be dynamically adjusted to achieve the best effect;
model fusion step:
respectively executing the sub-model establishing step for multiple times for each base model, integrating all the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, fusing prediction results output by all the sub-models to be used as prediction results of the hard disk fault prediction model, and performing parameter optimization and evaluation on the fault prediction model by using the test set;
for different base models, when the sub-model establishing step is executed, the proportion p can be the same or different;
optionally, in this embodiment, for each base model, 5 sub-models are generated, and finally 3 × 5 — 15 sub-models are generated.
Because the hard disk data set is a data set with polar imbalance of positive and negative samples, the good disk samples are far more than the bad disk samples, and in order to reduce the influence of sample inclination on the prediction result, as an optimal implementation manner, in the sub-model establishing step of this embodiment, the base model is trained by using a training set, including:
randomly undersampling the good disc samples in the training set, forming a new training set by using the good disc samples obtained by sampling and all bad disc samples in the training set, and training the base model by using the samples in the new training set;
randomly undersampling the samples of the training set with good discs, comprising:
randomly sampling the good discs related to the training set according to a preset sampling proportion;
for each good plate obtained by sampling, randomly selecting a sample from the samples, and forming a random undersampling result by all the randomly selected good plate samples;
based on the random undersampling process, the method can ensure that the distribution range of good disc samples is as large as possible, and further ensure the prediction precision and robustness of the model.
In order to ensure the difference between the sub-models, as a preferred implementation manner, in this embodiment, the random seeds of the base model are different each time the sub-model establishing step is executed;
based on the above model training process, in this embodiment, the model fusion process is as shown in fig. 2, when the hard disk failure prediction model predicts that a disk is a good disk, the prediction result output correspondingly is 0, and when the disk is predicted to be a bad disk, the prediction result output correspondingly is 1;
in the model fusion step of this embodiment, a test set is used to perform parameter tuning on the hard disk failure prediction model obtained by fusion, specifically, super parameter tuning is performed in the aspects of the number of trees, the depth of trees, the learning rate, and the like. Meanwhile, in order to evaluate the performance of the model, the invention adopts F-measure as the evaluation index of the model, and the calculation formula is as follows:
Figure BDA0002740117430000101
wherein Precision is Precision rate which indicates the proportion of correctly detected fault disks to actually detected fault hard disks; recall refers to the proportion of correctly detected failed disks to all failed disks; precision and Recall are respectively calculated as:
Figure BDA0002740117430000102
wherein, TP represents the number of detected bad disks, FN represents the number of undetected bad disks, that is, TP + FN represents the number of all bad disks in the data set; FP represents the number of good disks predicted as bad disks, TN represents the number of good disks predicted as good disks, i.e., TP + FP represents the number of bad disks predicted in the dataset, and the four values make up the confusion matrix as shown in table 1:
TABLE 1
Figure BDA0002740117430000103
As an optional implementation manner, in the feature engineering step, constructing a new feature through feature engineering includes:
for each basic feature, calculating one or more statistical features of the basic feature, and taking each statistical feature as a new feature;
wherein the statistical features comprise maxima, and/or means;
many indexes in the hard disk, such as data write-in quantity, check error number and the like, are changed and gradually increased along with time, the maximum value and the average value of the same index of a good disk and a bad disk may be greatly different in many cases, for example, the maximum value of the check error number of the bad disk may be many times that of the good disk, and therefore, the statistical characteristics have a large influence on the prediction result of the model; according to the method, through feature engineering, statistical information such as the maximum value, the mean value and the like of basic features are used as new features, so that the prediction accuracy of the model is improved;
constructing new features through feature engineering, further comprising:
for each basic feature, dividing the original value by the normalized value to serve as a new feature;
for each feature of the hard disk, the manufacturer provides a raw value (original value) and a normalized value (normalized value), the normalized value is the result of some normalization operation on the raw value, meaning that there should be a potential relationship between the two to be mined, and the normalized values are both within the interval [0,1], so that the two can be divided to generate a new feature. In the embodiment, the original value of the basic feature is divided by the normalized value to serve as a new feature, so that the relationship among the features can be further excavated, and the prediction accuracy of the model can be improved;
it should be noted that the new features constructed by the feature engineering in the present invention are not limited to the above new features, and in other embodiments of the present invention, the new features may be constructed in other manners based on the relationship between the features.
As an optional implementation manner, in the feature engineering step, a partial feature that enables the hard disk failure prediction accuracy to be highest is selected from all features by using an encapsulation method, and when the encapsulation method is executed, an adopted selection model is one of the established base models, in this embodiment, the adopted selection model is specifically LightGBM, K features are eliminated in each recursion, and in the present invention, the value of K is 1. Correspondingly, a process of selecting a part of features which enable the hard disk failure prediction accuracy to be highest from all the features by using a packaging method is shown in fig. 3, and specifically includes the following steps:
a. assuming that the total number of the features is M, deleting K features from the total number of the features;
b. detecting the model scores under the current residual M-K characteristics by using a LightGBM;
c. if the current score is improved, taking the current M-K characteristics as the optimal characteristic subset; if the current score is reduced or unchanged, the K features cannot be deleted;
d. circularly executing the steps a-c until all the characteristics are traversed;
compared with other machine models, the LightGBM model has the characteristics of high training speed and high accuracy, and the time spent on multiple detections is short, so that the optimal feature subset can be quickly and accurately screened out by taking the LightGBM as the selection model in the embodiment; in some other embodiments of the present invention, other selection models may be used, such as random forest (RandomForest), support vector machine (SupportVectorMachine), Xgboost, decision tree (DecisionTree), etc.;
in the embodiment, an optimal feature subset is screened from all features by using an encapsulation method, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction, so that the prediction accuracy of the model is further improved.
As a preferred implementation manner, in the feature engineering step of this embodiment, before constructing a new feature through feature engineering, the method further includes:
after data cleaning is carried out on historical data, the characteristic that the difference between the maximum value and the minimum value does not exceed a preset threshold value is taken as a basically unchangeable characteristic to be removed;
the data cleaning mainly comprises the steps of removing abnormal values and filling the missing values, because some missing values and abnormal values can be generated inevitably in the data set, the model can not be well constructed due to the existence of the values, the abnormal values can be removed through the data cleaning, the missing values are filled, and the modeling effect is guaranteed; when processing missing values, if the missing degree of the column is more than 20%, the feature is directly deleted; otherwise mean filling is used; the method of directly removing abnormal values is adopted, and because the abnormal values are a small number, the influence of the direct removal on the whole data of the hard disk can be ignored;
in this embodiment, the main purpose of the feature engineering is to find out an index that is easy to distinguish a good disc from a bad disc, so that a good modeling effect can be brought.
In general, the embodiment extracts basic features from SMART information, and on the basis, new features are constructed through feature engineering, so that the relationships in and among the original SMART features can be mined, and meaningful new features are introduced; on the basis of all the characteristics, an optimal characteristic subset is further screened out, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction; a plurality of independent submodels with strong prediction capability are generated based on different machine learning models, then the outputs of the submodels are fused, the generalization capability of the final model to different data sets is enhanced, and the hard disk failure detection rate is effectively improved. Therefore, the accuracy of hard disk fault detection can be effectively improved.
Example 2:
a hard disk failure prediction method based on model fusion is disclosed, as shown in FIG. 4, and includes:
for the real-time data of the SMART information of the hard disk collected from the data center, constructing the corresponding characteristics of the real-time data according to the optimal characteristic subset obtained by the hard disk failure prediction method based on the model fusion provided by the embodiment 1;
inputting the characteristics corresponding to the real-time data into each sub-model in the hard disk failure prediction model obtained by the hard disk failure prediction method based on model fusion provided in the embodiment 1;
performing soft voting on the prediction result of each sub-model, and taking the soft voting result as a final hard disk failure prediction result;
the hard disk failure prediction model established by the hard disk failure prediction method based on model fusion provided by the embodiment 1 has higher prediction accuracy, and based on the model, the hard disk failure prediction method based on model fusion provided by the embodiment has higher prediction accuracy;
soft voting refers to summing the prediction probabilities of each model and then averaging, and then taking the higher one of the good disk probability and the bad disk probability as a final prediction result; the soft voting has the advantage over the hard voting that different weights can be dynamically assigned to the sub-models, so that a better prediction result is obtained; in the embodiment, when the prediction results output by each sub-model are fused, soft voting is adopted, so that a better prediction result can be obtained.
Example 3:
a computer readable storage medium comprising a stored computer program;
when the computer program is executed by the processor, the apparatus on which the computer readable storage medium is located is controlled to execute the method for building the model fusion-based hard disk failure prediction model provided in embodiment 1 above and/or the method for predicting the hard disk failure based on the model fusion provided in embodiment 2 above.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A hard disk failure prediction model building method based on model fusion is characterized by comprising the following steps:
characteristic engineering steps: extracting basic features from historical data of hard disk SMART information periodically collected from a data center, constructing new features through feature engineering, and selecting partial features with highest hard disk failure prediction precision from all the features to obtain an optimal feature subset;
a data set construction step: according to the optimal feature subset, constructing a feature corresponding to each piece of data in the historical data, wherein the feature of each piece of data and the corresponding hard disk state form a sample, and all the samples form a training data set; dividing the training data set into a training set and a test set;
establishing a base model: establishing a plurality of different machine learning models which are respectively used for carrying out fault prediction according to the characteristic data of the hard disk, wherein each machine learning model is respectively used as a base model;
establishing a sub-model: for a single base model, randomly selecting partial features from the optimal feature subset according to a specified proportion, and training the base model by using the training set to obtain a sub-model; in the training process, only the selected features in each sample are taken as input;
model fusion step: and respectively executing the sub-model establishing step for multiple times for each base model, integrating all the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, fusing prediction results output by all the sub-models to be used as prediction results of the hard disk fault prediction model, and performing parameter optimization and evaluation on the fault prediction model by using the test set.
2. The method for building a hard disk failure prediction model based on model fusion according to claim 1, wherein in the sub-model building step, the training of the base model by using the training set comprises:
randomly undersampling the good disc samples in the training set, forming a new training set by using the good disc samples obtained by sampling and all bad disc samples in the training set, and training the base model by using the samples in the new training set;
randomly undersampling the samples of the training set with good discs, comprising:
randomly sampling the good discs related to the training set according to a preset sampling proportion;
and for each good disk obtained by sampling, randomly selecting a sample from the samples, and forming a random undersampling result by all the randomly selected good disk samples.
3. The model fusion-based hard disk failure prediction model building method of claim 1, wherein the random seed of the base model is different each time the sub-model building step is performed.
4. The method for building a hard disk failure prediction model based on model fusion according to any one of claims 1 to 3, wherein in the step of building the base model, the built machine learning models are respectively: catboost, Xgboost and LightGBM.
5. The method for building a hard disk failure prediction model based on model fusion according to any one of claim 1, wherein in the feature engineering step, new features are built through feature engineering, and the method comprises the following steps:
for each basic feature, calculating one or more statistical features of the basic feature, and taking each statistical feature as a new feature;
wherein the statistical features comprise maxima, and/or means.
6. The method for building a hard disk failure prediction model based on model fusion according to any one of claim 1, wherein in the feature engineering step, new features are built through feature engineering, and the method comprises the following steps:
for each base feature, the original value is divided by the normalized value as the new feature.
7. The method according to claim 1, wherein in the step of feature engineering, a partial feature which enables the hard disk fault prediction accuracy to be highest is selected from all features by using a packing method, and the selected model is one of the established base models when the packing method is executed.
8. The method for building a hard disk failure prediction model based on model fusion according to any one of claims 5 to 7, wherein in the feature engineering step, before building a new feature by feature engineering, the method further comprises:
and after data cleaning is carried out on the historical data, removing the characteristic that the difference between the maximum value and the minimum value does not exceed a preset threshold value as a basically unchangeable characteristic.
9. A hard disk failure prediction method based on model fusion is characterized by comprising the following steps:
for real-time data of SMART information of a hard disk collected from a data center, constructing features corresponding to the real-time data according to an optimal feature subset obtained by the hard disk fault prediction method based on model fusion according to any one of claims 1 to 8;
respectively inputting the characteristics corresponding to the real-time data into each sub-model in the hard disk failure prediction model obtained by the hard disk failure prediction method based on model fusion according to any one of claims 1 to 8;
and performing soft voting on the prediction result of each sub-model, and taking the soft voting result as a final hard disk failure prediction result.
10. A computer-readable storage medium comprising a stored computer program;
when being executed by a processor, the computer program controls a device on which the computer readable storage medium is located to execute the method for building a model fusion-based hard disk failure prediction model according to any one of claims 1 to 8, and/or the method for predicting a model fusion-based hard disk failure according to claim 9.
CN202011147445.4A 2020-10-23 2020-10-23 Hard disk fault prediction model establishing method based on model fusion and application thereof Pending CN112214369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011147445.4A CN112214369A (en) 2020-10-23 2020-10-23 Hard disk fault prediction model establishing method based on model fusion and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011147445.4A CN112214369A (en) 2020-10-23 2020-10-23 Hard disk fault prediction model establishing method based on model fusion and application thereof

Publications (1)

Publication Number Publication Date
CN112214369A true CN112214369A (en) 2021-01-12

Family

ID=74055239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011147445.4A Pending CN112214369A (en) 2020-10-23 2020-10-23 Hard disk fault prediction model establishing method based on model fusion and application thereof

Country Status (1)

Country Link
CN (1) CN112214369A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446557A (en) * 2021-01-29 2021-03-05 北京蒙帕信创科技有限公司 Disk failure prediction evasion method and system based on deep learning
CN112782589A (en) * 2021-01-26 2021-05-11 武汉理工大学 Vehicle-mounted fuel cell remote fault classification diagnosis method and device and storage medium
CN112983843A (en) * 2021-03-01 2021-06-18 合肥恒大江海泵业股份有限公司 Intelligent control system and control method of submersible electric pump
CN113057586A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
CN113298120A (en) * 2021-04-29 2021-08-24 上海淇玥信息技术有限公司 User risk prediction method and system based on fusion model and computer equipment
CN113592019A (en) * 2021-08-10 2021-11-02 平安银行股份有限公司 Fault detection method, device, equipment and medium based on multi-model fusion
CN113935400A (en) * 2021-09-10 2022-01-14 东风商用车有限公司 Vehicle fault diagnosis method, device and system and storage medium
CN114116292A (en) * 2022-01-27 2022-03-01 华南理工大学 Hard disk fault prediction method fusing AP clustering and width learning system
WO2022227373A1 (en) * 2021-04-26 2022-11-03 华为技术有限公司 Hard disk health evaluation method and storage device
CN116933043A (en) * 2023-09-15 2023-10-24 天津现代创新中药科技有限公司 Identification method of chrysanthemum production place, construction method of model and electronic equipment
US11994934B2 (en) 2021-11-09 2024-05-28 Samsung Electronics Co., Ltd. Failure prediction method and device for a storage device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104503874A (en) * 2014-12-29 2015-04-08 南京大学 Hard disk failure prediction method for cloud computing platform
CN108986869A (en) * 2018-07-26 2018-12-11 南京群顶科技有限公司 A kind of disk failure detection method predicted using multi-model
US20190278653A1 (en) * 2018-03-07 2019-09-12 Renato Padilla, JR. Dynamic error handling in a memory system
CN110689021A (en) * 2019-10-17 2020-01-14 哈尔滨理工大学 Real-time target detection method in low-visibility environment based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104503874A (en) * 2014-12-29 2015-04-08 南京大学 Hard disk failure prediction method for cloud computing platform
US20190278653A1 (en) * 2018-03-07 2019-09-12 Renato Padilla, JR. Dynamic error handling in a memory system
CN108986869A (en) * 2018-07-26 2018-12-11 南京群顶科技有限公司 A kind of disk failure detection method predicted using multi-model
CN110689021A (en) * 2019-10-17 2020-01-14 哈尔滨理工大学 Real-time target detection method in low-visibility environment based on deep learning

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112782589A (en) * 2021-01-26 2021-05-11 武汉理工大学 Vehicle-mounted fuel cell remote fault classification diagnosis method and device and storage medium
CN112446557A (en) * 2021-01-29 2021-03-05 北京蒙帕信创科技有限公司 Disk failure prediction evasion method and system based on deep learning
CN112983843A (en) * 2021-03-01 2021-06-18 合肥恒大江海泵业股份有限公司 Intelligent control system and control method of submersible electric pump
CN113057586A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
CN113057586B (en) * 2021-03-17 2024-03-12 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
WO2022227373A1 (en) * 2021-04-26 2022-11-03 华为技术有限公司 Hard disk health evaluation method and storage device
CN113298120B (en) * 2021-04-29 2023-08-01 上海淇玥信息技术有限公司 Fusion model-based user risk prediction method, system and computer equipment
CN113298120A (en) * 2021-04-29 2021-08-24 上海淇玥信息技术有限公司 User risk prediction method and system based on fusion model and computer equipment
CN113592019A (en) * 2021-08-10 2021-11-02 平安银行股份有限公司 Fault detection method, device, equipment and medium based on multi-model fusion
CN113592019B (en) * 2021-08-10 2023-09-15 平安银行股份有限公司 Fault detection method, device, equipment and medium based on multi-model fusion
CN113935400A (en) * 2021-09-10 2022-01-14 东风商用车有限公司 Vehicle fault diagnosis method, device and system and storage medium
US11994934B2 (en) 2021-11-09 2024-05-28 Samsung Electronics Co., Ltd. Failure prediction method and device for a storage device
CN114116292A (en) * 2022-01-27 2022-03-01 华南理工大学 Hard disk fault prediction method fusing AP clustering and width learning system
CN116933043A (en) * 2023-09-15 2023-10-24 天津现代创新中药科技有限公司 Identification method of chrysanthemum production place, construction method of model and electronic equipment

Similar Documents

Publication Publication Date Title
CN112214369A (en) Hard disk fault prediction model establishing method based on model fusion and application thereof
CN111209131B (en) Method and system for determining faults of heterogeneous system based on machine learning
CN110413227B (en) Method and system for predicting remaining service life of hard disk device on line
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
WO2017129032A1 (en) Disk failure prediction method and apparatus
CN107168995B (en) Data processing method and server
CN109918313B (en) GBDT decision tree-based SaaS software performance fault diagnosis method
CN111274126A (en) Test case screening method, device and medium
CN113010389A (en) Training method, fault prediction method, related device and equipment
US11704186B2 (en) Analysis of deep-level cause of fault of storage management
CN111767162B (en) Fault prediction method for hard disks of different models and electronic device
CN112951311A (en) Hard disk fault prediction method and system based on variable weight random forest
US20240168835A1 (en) Hard disk failure prediction method, system, device and medium
CN111091863A (en) Storage equipment fault detection method and related device
CN113822336A (en) Cloud hard disk fault prediction method, device and system and readable storage medium
US10776240B2 (en) Non-intrusive performance monitor and service engine
JP6666489B1 (en) Failure sign detection system
CN117421145B (en) Heterogeneous hard disk system fault early warning method and device
CN117093433B (en) Fault detection method and device, electronic equipment and storage medium
CN117369732B (en) Logic disc processing method and device, electronic equipment and storage medium
US11669262B2 (en) Method, device, and product for managing scrubbing operation in storage system
US20240176726A1 (en) Computer application error root cause diagnostic tool
CN117251327A (en) Model training method, disk fault prediction method, related device and equipment
CN117591351A (en) Disk fault detection model training method and disk fault detection method
CN115563309A (en) Failure analysis fault point prediction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210112