CN112214369A

CN112214369A - Hard disk fault prediction model establishing method based on model fusion and application thereof

Info

Publication number: CN112214369A
Application number: CN202011147445.4A
Authority: CN
Inventors: 陈俭喜; 冯丹; 陈彧; 陈鑫宇; 马莉珍; 郑梦丽; 董深育
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-12

Abstract

The invention discloses a hard disk failure prediction model establishing method based on model fusion and application thereof, belonging to the technical field of computer storage and comprising the following steps: extracting basic features from historical data of SMART information of the hard disk, constructing new features, and then screening an optimal feature subset; constructing the characteristics corresponding to each piece of historical data according to the screening result, forming a sample with the corresponding hard disk state to obtain a training data set, and then dividing the training data set into a training set and a test set; establishing a plurality of different machine learning models to obtain a plurality of base models; executing the sub-model establishing step for each basic model for multiple times, integrating the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, and performing parameter tuning and evaluation on the hard disk fault prediction model by using a test set; the sub-model establishing steps are as follows: randomly selecting partial features from the optimal feature subset, and training a single basic model by using a training set to obtain a sub-model; only the selected features are used as input during training. The invention can improve the accuracy of hard disk failure prediction.

Description

Hard disk fault prediction model establishing method based on model fusion and application thereof

Technical Field

The invention belongs to the technical field of computer storage, and particularly relates to a hard disk fault prediction model building method based on model fusion and application thereof.

Background

With the development of technologies such as big data and cloud computing in recent years, a large data center is established by many internet enterprises such as microsoft, google, and arbiba to provide cloud services for users. However, as the number of users increases and the storage scale increases suddenly, various failures occur in the data center, wherein the hard disk is used as a main device for storing data, compared with other devices, the hard disk has the characteristics of large quantity and short service life, and the failures of the hard disk can greatly affect the reliability and the user experience of the data center. Therefore, early prediction of hard disk failures is of great value to the enterprise.

The existing method for improving the reliability of the data center mainly comprises two aspects of active fault tolerance and passive fault tolerance. Passive fault tolerance mainly includes erasure code, backup and other technologies, which need to spend a large amount of cost, and at the same time, as the amount of data increases, the cost will become larger and larger, so the use of the technology is also limited to a certain extent. Compared with passive fault tolerance, active fault tolerance has obvious advantages in cost. One commonly used active fault tolerance method is to periodically monitor the SMART (Self-Monitoring Analysis and Reporting Technology) attribute of the hard disk, which when it reaches a specified threshold indicates that the hard disk may fail. Another common active fault tolerance method is a method combining machine learning, and considers hard disk faults as a binary problem, and adopts historical SMART data to perform modeling, and predicts newly acquired data, thereby determining whether the hard disk faults.

In the active fault-tolerant technology, the hard disk fault is judged to be too single only through monitoring of the SMART attribute, and the accuracy is low. The traditional machine learning method generally adopts a single model to solve the problem, and simultaneously only adopts the SMART attribute of the hard disk, does not well excavate the relationship between the attributes, and cannot be well applied to the actual production environment. In general, the accuracy of the existing active fault-tolerant technology for hard disk fault prediction needs to be further improved.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a hard disk failure prediction model establishing method based on model fusion and application thereof, and aims to improve the accuracy of hard disk failure prediction.

In order to achieve the above object, according to an aspect of the present invention, there is provided a method for establishing a hard disk failure prediction model based on model fusion, including:

characteristic engineering steps: extracting basic features from historical data of hard disk SMART information periodically collected from a data center, constructing new features through feature engineering, and selecting partial features with highest hard disk failure prediction precision from all the features to obtain an optimal feature subset;

a data set construction step: according to the optimal feature subset, constructing a feature corresponding to each piece of data in the historical data, wherein the feature of each piece of data and the corresponding hard disk state form a sample, and all the samples form a training data set; dividing a training data set into a training set and a test set;

establishing a base model: establishing a plurality of different machine learning models which are respectively used for carrying out fault prediction according to the characteristic data of the hard disk, wherein each machine learning model is respectively used as a base model;

establishing a sub-model: for a single base model, randomly selecting partial features from the optimal feature subset according to a specified proportion, and training the base model by using a training set to obtain a sub-model; in the training process, only the selected features in each sample are taken as input;

model fusion step: and respectively executing the sub-model establishing step for multiple times for each base model, integrating all the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, fusing prediction results output by all the sub-models to be used as the prediction result of the hard disk fault prediction model, and performing parameter optimization and evaluation on the fault prediction model by using the test set.

The method extracts basic features from the SMART information, constructs new features through feature engineering on the basis, and can excavate the relationship inside and between the original SMART features and introduce meaningful new features; on the basis of all the characteristics, an optimal characteristic subset is further screened out, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction; a plurality of independent submodels with strong prediction capability are generated based on different machine learning models, then the outputs of the submodels are fused, the generalization capability of the final model to different data sets is enhanced, and the hard disk failure detection rate is effectively improved. Therefore, the invention can effectively improve the accuracy of hard disk fault detection.

Further, in the sub-model establishing step, training the base model by using a training set, including:

randomly undersampling the good disc samples in the training set, forming a new training set by using the good disc samples obtained by sampling and all bad disc samples in the training set, and training the base model by using the samples in the new training set;

randomly undersampling the samples of the training set with good discs, comprising:

randomly sampling the good discs related to the training set according to a preset sampling proportion;

and for each good disk obtained by sampling, randomly selecting a sample from the samples, and forming a random undersampling result by all the randomly selected good disk samples.

Because the hard disk data set is a data set with unbalanced positive and negative samples, good disk samples are far more than bad disk samples, the invention undersamples the good disk samples, and the good disk samples and all the bad disk samples obtained by sampling are used as data sets for training the basic model, thereby effectively reducing the influence of sample inclination on the prediction result; in the undersampling process, all good disks are sampled, and then one good disk sample is randomly selected from the good disk samples obtained by sampling, so that the distribution range of the good disk samples can be ensured to be as large as possible, and the prediction accuracy and robustness of the model can be further ensured.

Further, the random seed of the base model is different each time the sub-model building step is performed, thereby ensuring the difference between all sub-models.

Further, in the step of establishing the base model, the established machine learning models are respectively: catboost, Xgboost and LightGBM.

When the failure prediction is carried out on the hard disk, the Catboost, the Xgboost and the LightGBM have higher training speed and higher prediction accuracy compared with other machine learning models, so that the method takes the Catboost, the Xgboost and the LightGBM as base models and can obtain better prediction effect.

Further, in the feature engineering step, constructing a new feature through feature engineering, including:

for each basic feature, calculating one or more statistical features of the basic feature, and taking each statistical feature as a new feature;

wherein the statistical features include a maximum value, and/or a mean value.

Many indexes in the hard disk, such as data write-in quantity, check error number and the like, are changed and gradually increased along with time, the maximum value and the average value of the same index of a good disk and a bad disk may be greatly different in many cases, for example, the maximum value of the check error number of the bad disk may be many times that of the good disk, and therefore, the statistical characteristics have a large influence on the prediction result of the model; according to the method, statistical information such as the maximum value, the mean value and the like of basic features are used as new features through feature engineering, and the prediction accuracy of the model is improved.

for each base feature, the original value is divided by the normalized value as the new feature.

For each feature of the hard disk, the manufacturer provides a raw value (original value) and a normalized value (normalized value), the normalized value is the result of some normalization operation on the raw value, meaning that there should be a potential relationship between the two to be mined, and the normalized values are both within the interval [0,1], so that the two can be divided to generate a new feature. According to the method, the original value of the basic feature is divided by the normalized value to serve as a new feature, so that the relationship among the features can be further excavated, and the prediction accuracy of the model can be improved.

Further, in the feature engineering step, a packaging method is used for selecting partial features which enable the hard disk fault prediction accuracy to be highest from all the features, and when the packaging method is executed, the adopted selection model is one of the established base models.

The invention screens out the optimal characteristic subset from all the characteristics by using an encapsulation method, can find the relation between the hard disk state information and the fault on the basis of dimension reduction, and further improves the prediction accuracy of the model.

Further, in the step of feature engineering, before constructing a new feature by feature engineering, the method further includes:

and after data cleaning is carried out on the historical data, the characteristic that the difference between the maximum value and the minimum value does not exceed a preset threshold value is taken as a basically unchangeable characteristic to be removed.

Because some missing values and abnormal values inevitably occur in the data set, the model cannot be well constructed due to the existence of the values, the abnormal values can be eliminated through data cleaning, and the missing values are filled, so that the modeling effect is ensured; in the invention, the main purpose of the characteristic engineering is to find out the index which is easy to distinguish a good disc from a bad disc so as to bring a good modeling effect, if the value of the characteristic does not change in the good disc or the bad disc for a long time, the characteristic is not useful for modeling, and the characteristic with basically unchanged values can effectively reduce the characteristic dimension and reduce the calculated amount.

According to another aspect of the present invention, a hard disk failure prediction method based on model fusion is provided, including:

for the real-time data of the SMART information of the hard disk collected from the data center, constructing the corresponding characteristics of the real-time data according to the optimal characteristic subset obtained by the hard disk failure prediction method based on model fusion provided by the invention;

respectively inputting the characteristics corresponding to the real-time data into each sub-model in the hard disk fault prediction model obtained by the hard disk fault prediction method based on model fusion provided by the invention;

and performing soft voting on the prediction result of each sub-model, and taking the soft voting result as a final hard disk failure prediction result.

The hard disk failure prediction model established by the hard disk failure prediction method based on model fusion has higher prediction accuracy, and the hard disk failure prediction method based on model fusion has higher prediction accuracy based on the model, and can obtain better prediction results by adopting soft voting when the prediction results output by each sub-model are fused.

According to yet another aspect of the present invention, there is provided a computer readable storage medium comprising a stored computer program;

when the computer program is executed by the processor, the device where the computer readable storage medium is located is controlled to execute the hard disk failure prediction model building method based on model fusion provided by the invention and/or the hard disk failure prediction method based on model fusion provided by the invention.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the method extracts basic features from the SMART information, constructs new features through feature engineering on the basis, and can excavate the relationship inside and between the original SMART features and introduce meaningful new features; on the basis of all the characteristics, an optimal characteristic subset is further screened out, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction; a plurality of independent submodels with strong prediction capability are generated based on different machine learning models, then the outputs of the submodels are fused, the generalization capability of the final model to different data sets is enhanced, and the hard disk failure detection rate is effectively improved. Therefore, the invention can effectively improve the accuracy of hard disk fault detection.

(2) According to the method, through feature engineering, statistical information of the SMART features and a result of dividing the original value by the normalized value are used as new features, the relation among the features can be fully excavated, and the accuracy of the model for predicting the hard disk fault is further improved.

(3) Before the new characteristics are constructed through the characteristic engineering, the data are cleaned, the characteristics which are basically unchanged are filtered, and the accuracy of the model for predicting the hard disk faults can be further ensured.

(4) The invention fuses the prediction results output by each sub-model in a soft voting mode, and can obtain better prediction effect.

Drawings

Fig. 1 is a flowchart of a method for establishing a hard disk failure prediction model based on model fusion according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model fusion method according to an embodiment of the present invention;

FIG. 3 is a flowchart of screening an optimal feature subset according to an embodiment of the present invention;

fig. 4 is a flowchart of a hard disk failure prediction method based on model fusion according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Before explaining the technical scheme of the invention in detail, the SMART information is briefly introduced as follows:

SMART is a Self-Monitoring, Analysis and Reporting Technology, which is an automatic hard disk state Monitoring and early warning system and specification, monitors and records the operating states of hardware such as a head, a disk, a motor and a circuit of a hard disk through detection instructions in the hard disk, and compares the operating states with preset safety knowledge set by a manufacturer. If the monitoring condition is about to exceed the safety range of the preset safety value, the monitoring hardware or software of the host can automatically warn the user and slightly automatically repair the user so as to ensure the safety of the hard disk data in advance. The SMART attribute has become a main index for monitoring the state of the hard disk in recent years, and in the following embodiments, the hard disk failure prediction is also performed based on the SMART data. The following are examples.

Example 1:

a method for establishing a hard disk failure prediction model based on model fusion is disclosed, as shown in FIG. 1, and comprises the following steps:

characteristic engineering steps:

extracting basic features from historical data of hard disk SMART information periodically collected from a data center, constructing new features through feature engineering, and selecting partial features with highest hard disk failure prediction precision from all the features to obtain an optimal feature subset;

the time interval for acquiring the hard disk SMART information from the data center can be determined according to the actual situation of the data center, and in the embodiment, the hard disk SMART information is acquired from the data center every day; in this embodiment, the extracting of the attribute as the basic feature from the SMART information mainly includes: the bottom layer data reading error rate, the starting time of the disk, the number of the relocatable magnetic area, the reported uncorrectable errors, the number of uncorrectable sectors and the like;

a data set construction step:

according to the optimal feature subset, constructing a feature corresponding to each piece of data in the historical data, wherein the feature of each piece of data and the corresponding hard disk state form a sample, and all the samples form a training data set; dividing a training data set into a training set and a test set;

in the embodiment, one piece of data in the historical data is SMART data of one hard disk for one day; optionally, in this embodiment, the training data set is divided into a training set and a test set according to a ratio of 7: 3;

establishing a base model:

establishing a plurality of different machine learning models which are respectively used for carrying out fault prediction according to the characteristic data of the hard disk, wherein each machine learning model is respectively used as a base model;

as a preferred implementation, in this embodiment, the machine learning models established are: catboost, Xgboost and LightGBM; when the hard disk is subjected to fault prediction, the Catboost, the Xgboost and the LightGBM have higher training speed and higher prediction accuracy compared with other machine learning models, so that the better prediction effect can be obtained by taking the Catboost, the Xgboost and the LightGBM as base models; it should be noted that the base models established here are only preferred embodiments of the present invention, and in some other embodiments of the present invention, the type and number of the established base models may be different from those of the present embodiment;

establishing a sub-model:

for a single base model, randomly selecting partial features from the optimal feature subset according to a specified proportion, and training the base model by using a training set to obtain a sub-model; in the training process, only the selected features in each sample are taken as input;

the proportion of the selected features is denoted as p, and accordingly, the number of features selected from the optimal feature subset can be expressed as

Wherein N is the total number of features in the optimal feature subset,

represents rounding down; the proportion p belongs to a hyper-parameter and can be dynamically adjusted to achieve the best effect;

model fusion step:

respectively executing the sub-model establishing step for multiple times for each base model, integrating all the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, fusing prediction results output by all the sub-models to be used as prediction results of the hard disk fault prediction model, and performing parameter optimization and evaluation on the fault prediction model by using the test set;

for different base models, when the sub-model establishing step is executed, the proportion p can be the same or different;

optionally, in this embodiment, for each base model, 5 sub-models are generated, and finally 3 × 5 — 15 sub-models are generated.

Because the hard disk data set is a data set with polar imbalance of positive and negative samples, the good disk samples are far more than the bad disk samples, and in order to reduce the influence of sample inclination on the prediction result, as an optimal implementation manner, in the sub-model establishing step of this embodiment, the base model is trained by using a training set, including:

for each good plate obtained by sampling, randomly selecting a sample from the samples, and forming a random undersampling result by all the randomly selected good plate samples;

based on the random undersampling process, the method can ensure that the distribution range of good disc samples is as large as possible, and further ensure the prediction precision and robustness of the model.

In order to ensure the difference between the sub-models, as a preferred implementation manner, in this embodiment, the random seeds of the base model are different each time the sub-model establishing step is executed;

based on the above model training process, in this embodiment, the model fusion process is as shown in fig. 2, when the hard disk failure prediction model predicts that a disk is a good disk, the prediction result output correspondingly is 0, and when the disk is predicted to be a bad disk, the prediction result output correspondingly is 1;

in the model fusion step of this embodiment, a test set is used to perform parameter tuning on the hard disk failure prediction model obtained by fusion, specifically, super parameter tuning is performed in the aspects of the number of trees, the depth of trees, the learning rate, and the like. Meanwhile, in order to evaluate the performance of the model, the invention adopts F-measure as the evaluation index of the model, and the calculation formula is as follows:

wherein Precision is Precision rate which indicates the proportion of correctly detected fault disks to actually detected fault hard disks; recall refers to the proportion of correctly detected failed disks to all failed disks; precision and Recall are respectively calculated as:

wherein, TP represents the number of detected bad disks, FN represents the number of undetected bad disks, that is, TP + FN represents the number of all bad disks in the data set; FP represents the number of good disks predicted as bad disks, TN represents the number of good disks predicted as good disks, i.e., TP + FP represents the number of bad disks predicted in the dataset, and the four values make up the confusion matrix as shown in table 1:

TABLE 1

As an optional implementation manner, in the feature engineering step, constructing a new feature through feature engineering includes:

wherein the statistical features comprise maxima, and/or means;

many indexes in the hard disk, such as data write-in quantity, check error number and the like, are changed and gradually increased along with time, the maximum value and the average value of the same index of a good disk and a bad disk may be greatly different in many cases, for example, the maximum value of the check error number of the bad disk may be many times that of the good disk, and therefore, the statistical characteristics have a large influence on the prediction result of the model; according to the method, through feature engineering, statistical information such as the maximum value, the mean value and the like of basic features are used as new features, so that the prediction accuracy of the model is improved;

constructing new features through feature engineering, further comprising:

for each basic feature, dividing the original value by the normalized value to serve as a new feature;

for each feature of the hard disk, the manufacturer provides a raw value (original value) and a normalized value (normalized value), the normalized value is the result of some normalization operation on the raw value, meaning that there should be a potential relationship between the two to be mined, and the normalized values are both within the interval [0,1], so that the two can be divided to generate a new feature. In the embodiment, the original value of the basic feature is divided by the normalized value to serve as a new feature, so that the relationship among the features can be further excavated, and the prediction accuracy of the model can be improved;

it should be noted that the new features constructed by the feature engineering in the present invention are not limited to the above new features, and in other embodiments of the present invention, the new features may be constructed in other manners based on the relationship between the features.

As an optional implementation manner, in the feature engineering step, a partial feature that enables the hard disk failure prediction accuracy to be highest is selected from all features by using an encapsulation method, and when the encapsulation method is executed, an adopted selection model is one of the established base models, in this embodiment, the adopted selection model is specifically LightGBM, K features are eliminated in each recursion, and in the present invention, the value of K is 1. Correspondingly, a process of selecting a part of features which enable the hard disk failure prediction accuracy to be highest from all the features by using a packaging method is shown in fig. 3, and specifically includes the following steps:

a. assuming that the total number of the features is M, deleting K features from the total number of the features;

b. detecting the model scores under the current residual M-K characteristics by using a LightGBM;

c. if the current score is improved, taking the current M-K characteristics as the optimal characteristic subset; if the current score is reduced or unchanged, the K features cannot be deleted;

d. circularly executing the steps a-c until all the characteristics are traversed;

compared with other machine models, the LightGBM model has the characteristics of high training speed and high accuracy, and the time spent on multiple detections is short, so that the optimal feature subset can be quickly and accurately screened out by taking the LightGBM as the selection model in the embodiment; in some other embodiments of the present invention, other selection models may be used, such as random forest (RandomForest), support vector machine (SupportVectorMachine), Xgboost, decision tree (DecisionTree), etc.;

in the embodiment, an optimal feature subset is screened from all features by using an encapsulation method, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction, so that the prediction accuracy of the model is further improved.

As a preferred implementation manner, in the feature engineering step of this embodiment, before constructing a new feature through feature engineering, the method further includes:

after data cleaning is carried out on historical data, the characteristic that the difference between the maximum value and the minimum value does not exceed a preset threshold value is taken as a basically unchangeable characteristic to be removed;

the data cleaning mainly comprises the steps of removing abnormal values and filling the missing values, because some missing values and abnormal values can be generated inevitably in the data set, the model can not be well constructed due to the existence of the values, the abnormal values can be removed through the data cleaning, the missing values are filled, and the modeling effect is guaranteed; when processing missing values, if the missing degree of the column is more than 20%, the feature is directly deleted; otherwise mean filling is used; the method of directly removing abnormal values is adopted, and because the abnormal values are a small number, the influence of the direct removal on the whole data of the hard disk can be ignored;

in this embodiment, the main purpose of the feature engineering is to find out an index that is easy to distinguish a good disc from a bad disc, so that a good modeling effect can be brought.

In general, the embodiment extracts basic features from SMART information, and on the basis, new features are constructed through feature engineering, so that the relationships in and among the original SMART features can be mined, and meaningful new features are introduced; on the basis of all the characteristics, an optimal characteristic subset is further screened out, and the relation between the hard disk state information and the fault can be found on the basis of dimension reduction; a plurality of independent submodels with strong prediction capability are generated based on different machine learning models, then the outputs of the submodels are fused, the generalization capability of the final model to different data sets is enhanced, and the hard disk failure detection rate is effectively improved. Therefore, the accuracy of hard disk fault detection can be effectively improved.

Example 2:

a hard disk failure prediction method based on model fusion is disclosed, as shown in FIG. 4, and includes:

for the real-time data of the SMART information of the hard disk collected from the data center, constructing the corresponding characteristics of the real-time data according to the optimal characteristic subset obtained by the hard disk failure prediction method based on the model fusion provided by the embodiment 1;

inputting the characteristics corresponding to the real-time data into each sub-model in the hard disk failure prediction model obtained by the hard disk failure prediction method based on model fusion provided in the embodiment 1;

performing soft voting on the prediction result of each sub-model, and taking the soft voting result as a final hard disk failure prediction result;

the hard disk failure prediction model established by the hard disk failure prediction method based on model fusion provided by the embodiment 1 has higher prediction accuracy, and based on the model, the hard disk failure prediction method based on model fusion provided by the embodiment has higher prediction accuracy;

soft voting refers to summing the prediction probabilities of each model and then averaging, and then taking the higher one of the good disk probability and the bad disk probability as a final prediction result; the soft voting has the advantage over the hard voting that different weights can be dynamically assigned to the sub-models, so that a better prediction result is obtained; in the embodiment, when the prediction results output by each sub-model are fused, soft voting is adopted, so that a better prediction result can be obtained.

Example 3:

a computer readable storage medium comprising a stored computer program;

when the computer program is executed by the processor, the apparatus on which the computer readable storage medium is located is controlled to execute the method for building the model fusion-based hard disk failure prediction model provided in embodiment 1 above and/or the method for predicting the hard disk failure based on the model fusion provided in embodiment 2 above.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A hard disk failure prediction model building method based on model fusion is characterized by comprising the following steps:

a data set construction step: according to the optimal feature subset, constructing a feature corresponding to each piece of data in the historical data, wherein the feature of each piece of data and the corresponding hard disk state form a sample, and all the samples form a training data set; dividing the training data set into a training set and a test set;

establishing a sub-model: for a single base model, randomly selecting partial features from the optimal feature subset according to a specified proportion, and training the base model by using the training set to obtain a sub-model; in the training process, only the selected features in each sample are taken as input;

model fusion step: and respectively executing the sub-model establishing step for multiple times for each base model, integrating all the sub-models into a hard disk fault prediction model after obtaining a plurality of sub-models, fusing prediction results output by all the sub-models to be used as prediction results of the hard disk fault prediction model, and performing parameter optimization and evaluation on the fault prediction model by using the test set.

2. The method for building a hard disk failure prediction model based on model fusion according to claim 1, wherein in the sub-model building step, the training of the base model by using the training set comprises:

3. The model fusion-based hard disk failure prediction model building method of claim 1, wherein the random seed of the base model is different each time the sub-model building step is performed.

4. The method for building a hard disk failure prediction model based on model fusion according to any one of claims 1 to 3, wherein in the step of building the base model, the built machine learning models are respectively: catboost, Xgboost and LightGBM.

5. The method for building a hard disk failure prediction model based on model fusion according to any one of claim 1, wherein in the feature engineering step, new features are built through feature engineering, and the method comprises the following steps:

wherein the statistical features comprise maxima, and/or means.

6. The method for building a hard disk failure prediction model based on model fusion according to any one of claim 1, wherein in the feature engineering step, new features are built through feature engineering, and the method comprises the following steps:

7. The method according to claim 1, wherein in the step of feature engineering, a partial feature which enables the hard disk fault prediction accuracy to be highest is selected from all features by using a packing method, and the selected model is one of the established base models when the packing method is executed.

8. The method for building a hard disk failure prediction model based on model fusion according to any one of claims 5 to 7, wherein in the feature engineering step, before building a new feature by feature engineering, the method further comprises:

and after data cleaning is carried out on the historical data, removing the characteristic that the difference between the maximum value and the minimum value does not exceed a preset threshold value as a basically unchangeable characteristic.

9. A hard disk failure prediction method based on model fusion is characterized by comprising the following steps:

for real-time data of SMART information of a hard disk collected from a data center, constructing features corresponding to the real-time data according to an optimal feature subset obtained by the hard disk fault prediction method based on model fusion according to any one of claims 1 to 8;

respectively inputting the characteristics corresponding to the real-time data into each sub-model in the hard disk failure prediction model obtained by the hard disk failure prediction method based on model fusion according to any one of claims 1 to 8;

10. A computer-readable storage medium comprising a stored computer program;

when being executed by a processor, the computer program controls a device on which the computer readable storage medium is located to execute the method for building a model fusion-based hard disk failure prediction model according to any one of claims 1 to 8, and/or the method for predicting a model fusion-based hard disk failure according to claim 9.