CN104503874A

CN104503874A - Hard disk failure prediction method for cloud computing platform

Info

Publication number: CN104503874A
Application number: CN201410837805.1A
Authority: CN
Inventors: 周嵩; 王景峰; 柏文阳; 宋云华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2015-04-08

Abstract

The invention discloses a hard disk failure prediction method for a cloud computing platform. The hard disk failure predication method comprises the following steps: marking SMART log data of a hard disk as a normal hard disk sample and a faulted hard disk sample according to a hard disk maintenance record in a prediction time window; then, dividing the denoised normal hard disk sample into k non-intersected subsets by adopting a K-means clustering algorithm; combining the k non-intersected subsets with the faulted hard disk sample respectively; generating k groups of balance training sets according to an SMOTE (Synthetic Minority Oversampling Technique) so as to obtain k support vector machine classifiers for predicting the faulted hard disk. In the prediction stage, test sets can be clustered by using a DBSCAN (Density-based Spatial Clustering Of Applications With Noise), a sample in a clustered cluster is predicted as the normal hard disk sample, a noise sample is predicted by each classifier obtained by training, and further a final prediction result is obtained by voting. According to the method disclosed by the invention, hard disk fault prediction is carried out by using the SMART data of the hard disk, and relatively high fault recall ratio and overall performance can be obtained.

Description

A kind of hard disk failure Forecasting Methodology of cloud computing platform

Technical field

The present invention relates to a kind of hard disk failure Forecasting Methodology of cloud computing platform, belong to computer data excavation applications, is a kind of hard disk failure prediction algorithm specifically.

Background technology

Hard disk failure prediction can assuring data security, raising O&M efficiency, control store cost.This technology relates to the technology in multiple fields such as cloud computing, data mining, hard disk SMART technology, failure prediction technology, pole unbalanced data sorting technique.Hard disk failure prediction mainly refers to rely on hard disk SMART data to carry out failure prediction.But, as document 1:PINHEIRO E, WEBERWD, BARROSO L A.Failure trends in a large disk drivepopulation [EB/OL]. [2012-10-10] .http: //research.google.com/archive/disk_failures.pdf. introduction, utilize statistical method, the failure cause of 36% can not be estimated to.

At present, HD vendor generally adopts threshold determination method to predict hard disk failure, namely utilizes hard disk SMART technology, gathers indices information when hard disk runs, and compared with the hard disk failure threshold value of warning of setting, exceedes threshold value just trigger fault warning.In order to reduce the hard disk quantity returning factory's Measuring error because of fault pre-alarming, thus reduce costs, HD vendor often selects rate of false alarm to be down to when setting threshold value minimum, but also sacrifices the accuracy rate of prediction simultaneously.Adopt the predictablity rate of threshold determination method to be about 3%-10%, rate of false alarm is about 0.1%.

Pinheiro etc. find to only have 4 SMART attributes to have certain associating with hard disk failure, namely scan mistake, code reassignment counting, off-line code reassignment counting and counting on probation.But when they find during statistical research on the hard disk of Google more than 100,000 pieces, the faulty hard disk more than 56% does not all have that any one in these 4 attributes has count value.Therefore, they think and only utilize SMART can not set up hard disk failure forecast model exactly, and be more suitable for the trend of prediction hard disk cluster, see document 2:E.Pinheiro, W.D.Weber, and L.A.Barroso, " Failuretrends in a large disk drive population; " in Proceedings of the 5th USENIX Symposium onFile and Storage Technologies (FAST 07), 2007.

The hard disk failure forecast model that Agarwal and Niranjanet etc. only utilize SMART information to adopt MLRules algorithm to set up obtains the verification and measurement ratio of 66% and the rate of false alarm of 3%, see document 3:Vipul Agarwal, ChiranjibBhattacharyya, ThirumaleNiranjan, et al.Discovering Rules from Disk Events for PredictingHard Drive Failures [C] .IEEE Computer Society, 2009.Hamerly and Elkan, Hughes and Murray etc., open the verification and measurement ratio that the super failure prediction model mainly utilizing SMART information and other environmental information etc. to set up obtains at most 56%, see document 4:Greg Hamerly, Charles Elkan.Bayesian Approaches to FailurePrediction for Disk Drives [C] .Morgan Kaufmann, 2001. document 5:Gordon F.Hughes, Joseph F.Murray, Kenneth Kreutz-delgado, et al.Improved Disk-Drive Failure Warnings [J] .IEEETransactions on Reliability.2002. document 6:Joseph F.Murray, Gordon F.Hughes, KennethKreutz-Delgado.Machine Learning Methods for Predicting Failures in Hard Drives:AMultiple-Instance Application [J] .Journal of Machine Learning research.2005, 6:783 ~ 816. document 7: Zhang Chao. high-performance magnetism disk array self-repair technology research [D]. the National University of Defense Technology, 2008. is different from the research of Pinheiro etc., experimental data in these researchs is all from the subset of depot repair hard disk set, failure rate observes when will use apparently higher than user is actual.

The hard disk failure forecasting techniques of cloud computing platform is the rare class forecasting problem on extremely unbalanced two-category data collection.Wherein, faulty disk is rare class, and non-fault dish is most class.At present, the strategy solving uneven classification problem mainly concentrates on data plane and algorithm aspect.The resolution policy of data plane reaches by again extracting sampling to data the object reducing data nonbalance degree, and method mainly comprises the combination of lack sampling, over-sampling and two kinds of methods.Random lack sampling likely can lose significant samples information; Over-sampling may cause the problem of study, also can increase the training time.The resolution policy of algorithm aspect roughly concentrates on three classes: the method for cost sensitive learning, support vector machine and combination.Cost sensitive learning according to circumstances adjusts punishment parameter, and in imbalance classification, align the class mistake point larger punishment parameter of setting and can improve the classifying quality of sorter in positive class, the effect of these class methods depends on the parameter of setting, support vector machine is relative to other sorting techniques, susceptibility for data nonbalance is lower, as at document 8:Japkowicz N, Stephen S.The class imbalance problem:A systematic study [J] .Intelligent data analysis, 2002, in 6 (5): 429-449., the people such as Japkowicz compare data nonbalance by experiment to different sorting technique, comprise decision tree C4.5, the impact of BP neural network and support vector machine etc., result shows that support vector machine is to data unbalancedness relative insensitivity, therefore on this problem, there is a lot of method based on support vector machine, several sorter combines by combined method exactly, improves classifying quality, and the difference between combined method needs Various Classifiers on Regional and skewed popularity are compromised, and easily causes the problem of study.

Summary of the invention

Goal of the invention: technical matters to be solved by this invention is for the deficiencies in the prior art, provides the hard disk failure Forecasting Methodology that a kind of recall ratio is high, overall performance is good.

Technical scheme: the invention discloses a kind of hard disk failure Forecasting Methodology, comprise the following steps:

Step one, according to hard disk maintenance record, by SMART (the Self-monitoring Analysis and Report Technology of hard disk broken down in failure prediction time window, self-monitoring analysis and reporting techniques) daily record data is labeled as faulty hard disk sample, and the SMART daily record data of the hard disk do not broken down is labeled as normal hard disk sample;

Whether, wherein, according to the SMART observed reading in hard disk a certain moment, utilize in certain section period of this hard disk of model prediction from this moment and can break down, this period is exactly hard disk failure predicted time window.

Step 2, DBSCAN (Density-Based Spatial Clustering ofApplications with Noise is adopted to normal hard disk sample, DBSCAN, the Noise application space cluster of density based) algorithm carries out cluster, remove the noisy samples outside clustering cluster, retain the normal hard disk sample to be polymerized to bunch;

Step 3, adopt K-means algorithm to carry out cluster in the normal hard disk sample after denoising, thus be divided into k disjoint subset, and be merged into k original training set respectively with faulty hard disk sample, wherein k is the number of K-means cluster, and the value of k is the natural number being less than sample size;

Step 4, to faulty hard disk sample evidence SMOTE (the Synthetic MinorityOver-sampling Technique in each original training set, a few sample synthesis oversampling technique) algorithm carries out over-sampling, make faulty hard disk sample in training set consistent with the quantity of normal hard disk sample, thus obtain k balance training collection;

Step 5, adopts LIBSVM instrument Training Support Vector Machines model on k balance training collection of radial basis function kernel respectively, obtains k support vector machine sub-classifier of integrated classifier;

Step 6, carries out cluster to test sample book centralized procurement DBSCAN algorithm, deletes the sample to be polymerized to bunch, retains noisy samples outside clustering cluster, and is normal hard disk sample by the sample predictions of deletion;

Step 7, by remaining noisy samples respectively with k the support vector machine classifier prediction that the training stage obtains, and ballot determines classification results, if be judged as that the votes of faulty hard disk sample exceedes the threshold value of setting to certain test sample book, then be predicted as fault, otherwise be normal.

In step 2 of the present invention, adopt DBSCAN algorithm to carry out cluster to normal hard disk sample and comprise the following steps:

Not accessed sample p in the optional normal hard disk sample set of step (21), the quantity of sample object in the neighborhood checking its radius Eps, if be more than or equal to the minimum of setting to comprise number of samples Minpts, then set up new bunch of C, by sample p and radius thereof be Eps neighborhood in all sample object add C; If be less than Minpts, then sample p is labeled as noisy samples;

A sample q do not accessed in the optional C of step (22), checks that the radius of q is the neighborhood of Eps, if the quantity of sample object is more than or equal to Minpts in its neighborhood, then the sample in sample q and neighborhood thereof is added C;

Step (23) repeats step (22), until all accessed mistake of sample object in C;

Step (24) repeats step (21) ~ (23), until all accessed mistake of all sample object in normal hard disk sample set, and is all added into certain bunch or is labeled as noise.

Wherein Eps represents radius, and its value is arithmetic number, and Minpts represents minimum and comprises number of samples, and its value is the natural number being less than sample size, and the value of Eps and Minpts determines the clustering performance of DBSCAN algorithm, usually can only rule of thumb determine.The maximum set of the point that DBSCAN algorithm is connected density is defined as bunch, do not need setting in advance to be formed bunch quantity, just can high-density region be divided into arbitrary shape bunch, and noise regions to be branched away.

In step 3 of the present invention, adopt K-means algorithm to carry out cluster to the normal hard disk sample after denoising and comprise the following steps:

In the optional normal hard disk sample set of step (31), k sample object is initial cluster center, and wherein k is the cluster number of setting;

Step (32) to calculate in normal hard disk sample set all samples to the distance of k cluster centre, and each sample is incorporated into the cluster of minimum distance;

Step (33) recalculates the cluster centre of k cluster, and cluster centre is the average of all sample object in this cluster;

Step (34) repeats step (32) ~ (33), until meet the condition of convergence.The condition of convergence can be that the cluster centre of twice iteration no longer changes or is less than threshold value according to setting, and this threshold value is a minimum real number relative to the distance between sample, is generally 10 ^-1~ 10 ^-5; Also can be that iterations reaches the maximum iteration time pre-set, maximum iteration time is set to a moderate integer usually, be generally 10 ~ 100, the too small meeting of value causes cluster centre deviation theory value too much, and value is excessive, and algorithm execution time can be caused to increase.

K-means algorithm is a kind of typically based on the clustering algorithm of distance, and need to determine cluster number k in advance, the value of k is the natural number being less than sample size.First Stochastic choice k sample is as initial cluster center, then calculates the distance of all the other samples to each cluster centre, and each sample is classified as nearest bunch.After all sample process complete, namely represent an interative computation.In calculating each bunch, the mean value of all sample object is new cluster centre, starts new round interative computation, when the cluster centre in adjacent twice iteration remains unchanged or is less than threshold value, then thinks that algorithm convergence, corresponding cluster are exactly optimum cluster result.Control algolithm maximum iteration time is the another kind of mode making K-means algorithm convergence.

In step 4 of the present invention, adopt SMOTE algorithm to carry out over-sampling to faulty hard disk sample and comprise the following steps:

The quantity T of faulty hard disk sample in the original training set generated in step (41) calculation procedure three, and set the quantity m of over-sampling ratio N and arest neighbors, make N ₁=FLOOR (N/100), N ₂=N%100, wherein FLOOR is downward bracket function, and % is remainder operation;

Faulty hard disk sample set S to be sampled is initialized as sky by step (42).First N is repeated ₁secondaryly in S, add all faulty hard disk samples, then make T '=(N ₂/ 100) * T, and the individual sample of Stochastic choice T ' adds S from faulty hard disk sample;

Step (43) is for each faulty hard disk sample p in faulty hard disk sample set S to be sampled, m the arest neighbors faulty hard disk sample of current failure hard disk sample p is found in training set, one of them neighbour's faulty hard disk sample of Stochastic choice q, generate an artificial sample of current sample p, concrete generative process is: the difference calculating the proper vector of current sample p and the proper vector of sample q, and the random number be multiplied by between 0 ~ 1, add the proper vector of current sample p, as the proper vector of newly-generated artificial sample.

SMOTE algorithm improves the quality of balance of training set by the mode of manual construction minority class sample.For minority class sample, one in its m of Stochastic choice (m is the positive integer being less than minority class total sample number) individual arest neighbors minority class sample, generate artificial sample, the proper vector of this artificial sample is positioned on certain random point of current failure hard disk sample and selected arest neighbors proper vector line.The artificial sample quantity generated is determined by over-sampling ratio N, and the value of N is any positive integer, generates 2 artificial samples as N=200 is expressed as each minority class sample.The mode of this stochastic generation artificial sample, while raising training set quality of balance, adds the quantity of information of minority class sample as much as possible, thus expands the decision region of minority class sample in sorter.

In step 5 of the present invention, adopt training pattern on LIBSVM instrument balance training collection after sampling, LIBSVM is that of the development and Design such as Taiwan Univ. professor Lin Zhiren is simple, be easy to the software package used with support vector machine pattern-recognition fast and effectively and recurrence, its use step for: first according to required by LIBSVM software package form prepare data set, and simple zoom operations is carried out to data, consider afterwards to select Radial basis kernel function, cross validation is adopted to select optimal parameter C and g, wherein C is punishment parameter, g is nuclear parameter, finally adopt optimal parameter C and g to carry out training to whole training set and obtain supporting vector machine model, and utilize the model obtained to carry out testing and predict.Radial basis kernel function is selected to be because it can process class label and the nonlinear situation of relation on attributes very well.Cross validation is for obtaining reliable and stable model: in given sample set, the most of sample of each selection carries out model training, stay fraction sample for model measurement, until all samples are all tested once and only once tested, using last Prediction sum squares as evaluation index.Using k balance training collection obtaining in step 4 as input data set, adopt radial basis kernel, obtaining k sub-classifier, for the prediction of faulty hard disk according to above-mentioned steps training.

Beneficial effect: hard disk failure Forecasting Methodology of the present invention compared with the conventional method advantage is: practical application scene of being more close to the users, and fault recall ratio is high, overall performance good.

Accompanying drawing explanation

To do the present invention below in conjunction with the drawings and specific embodiments and further illustrate, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is main flow figure of the present invention.

Fig. 2 is main time graph of a relation of the present invention.

Fig. 3 is adopted data set faulty hard disk distribution plan by the present invention.

Embodiment:

With embodiment, the present invention is described in further detail by reference to the accompanying drawings:

The present invention discloses a kind of cloud computing platform hard disk failure Forecasting Methodology, first according to the hard disk maintenance record in predicted time window, hard disk SMART daily record data is labeled as normal hard disk sample and faulty hard disk sample, adopt K-means clustering algorithm that the normal hard disk sample after removal noise is divided into k disjoint subset afterwards, and be combined with faulty hard disk sample respectively, k group balance training collection is generated according to SMOTE over-sampling algorithm, train with this and obtain k support vector machine classifier, for the prediction of faulty hard disk.At forecast period, first adopt DBSCAN clustering algorithm to carry out cluster to test set, be normal hard disk sample by the sample predictions in clustering cluster, and each sorter that noisy samples utilizes training to obtain is predicted, and ballot is finally predicted the outcome.

Specifically, as shown in Figure 1, the present invention includes following steps:

Step one, according to hard disk maintenance record, is labeled as faulty hard disk sample by the SMART daily record data of the hard disk broken down in failure prediction time window, the SMART daily record data of the hard disk do not broken down is labeled as normal hard disk sample;

Step 2, adopts DBSCAN algorithm to carry out cluster to normal hard disk sample, removes the noisy samples outside clustering cluster, retains the normal hard disk sample to be polymerized to bunch;

Step 4, carries out over-sampling to the faulty hard disk sample evidence SMOTE algorithm in each original training set, makes faulty hard disk sample in training set consistent with the quantity of normal hard disk sample, thus obtains k balance training collection;

The present invention adopt data acquisition from the cloud computing cluster of the online service of actual certain cloud computing supplier, this cluster comprises 4299 nodes, the hard disk of totally 51703 pieces of different vendors or model.

Hard disk SMART information is gathered by each node deploy script in the cluster, and the SMART information of collection is stored as SMART daily record according to fixing similar " key: value " right form.Each physical node of cluster is deployed with the finger daemon of a lightweight, is used for collecting the various log information of local set form, and the information of collection is concentrated be stored in a distributed data base.SMART log sheet in database is stored as SMART data file with CSV form, and this data file is exactly the most original experimental data files of the present invention.Data acquisition is to each node timed collection every day SMART information on target cluster, have collected the information of 66 days altogether, time range was from 03 28th, 2013 on 06 01st, 2013, whole 24 SMART that have collected all hard disks in cluster provides detect the related parameter values of attribute, and other 8 information such as HD vendor, model, every block hard disk amounts to 224 dimension information.

Consider that the time window of date restoring requires possible ageing with SMART information, the predicted time window setting faulty hard disk in step one is 24 hours.Like this when predicting hard disk and being about to break down, both ensure that data restore time, simultaneously again when SMART information timeliness length can not be confirmed, ensure the failure prediction performance that can not affect model because of the SMART timeliness that may exist as far as possible.

Disk state is divided into two classes by the present invention: " normally " and " being about to break down in 24 hours ".Be defined as by hard disk failure: certain block hard disk is identified needs and changes, then think that hard disk breaks down, the time of fault is confirm the time changed that needs repairing.The visible Fig. 2 of time related sequence relation.

The relevant information of hard disk maintenance comes from another database.Every day hard disk service record data storehouse upgrade.During on 03 28th, 2013 on 06 02nd, 2013, target cluster has the record that 362 hard disks confirm to need to change.Because the relevant specifying information of hard disk maintenance record itself exists disappearance, therefore, 362 hard disks confirm to change in record, have 240 records not navigate to hard disk, only have 122 to be recorded as effective record.

Computing formula or the threshold value of the SMART property value of different vendor or model are all likely inconsistent, and in order to eliminate the impact of manufacturer, model, the present invention chooses the hard disk of the maximum same model of same manufacturer of quantity in target cluster as research object.The HD vendor chosen is " Seagate Constellation ES (SATA) ", and hard disk model is " ST32000644NS ".This model hard disk, in 67 days time intervals of on 03 28th, 2013 on 06 02nd, 2013, confirms the hard disk distributed number of maintain and replace as shown in Figure 3.

According to Fig. 3, in the time interval of 67 days, the hard disk of 28 days totally 45 pieces of ST32000644NS models is only had to there occurs fault.On 04 01st, 2013 that wherein breaks down maximum, the hard disk having 4 pieces of ST32000644NS models there occurs fault.Distribute according to number of faults, consider the absolute sample number needing the faulty hard disk ensureing some, the hard disk SMART finish message that the present invention chooses on 03 31st, 2013 becomes training sample set, becomes two parts of test sample book collection by 03 28th, 2013 with the data preparation of on 04 17th, 2013.

The present invention is based on Waikato intellectual analysis environment (Waikato Environment for Knowledge Analysis, WEKA) platform to test.WEKA be a kind of based on Java, free, increase income, for the software of machine learning and data mining, by exploitations such as Ian H.Witten and Eibe Frank.WEKA platform is integrated with a large amount of machine learning algorithm, as data prediction, classification and recurrence, cluster, correlation rule etc.WEKA is one of nowadays complete Data Mining Tools.

Utilizing before training set carries out model training, needing to carry out pre-service to data, mainly comprising feature selecting, data cleansing and data transformation.Feature selecting mainly deletes some features comprising garbage, thus reduces storage and computing cost; Data cleansing is that delete property eigenwert lacks too much record on the one hand, is fill missing values on the other hand; Data transformation mainly carries out data normalization, standardization and sliding-model control to data set.After the data prediction such as feature selecting, data cleansing and data transformation, obtain final experimental standard data set, comprise training set, test set 0328 and test set 0417, as shown in table 1.

The pretreated data set features of table 1

The present invention carries out abnormality detection research on the unbalanced dataset of pole, and directly carry out abnormality detection based on experimental data collection according to general method, fault sample is all flooded by normal sample usually, is difficult to be excavated.Therefore, need to reduce the normal hard disk sample size of training dataset, reduce the imbalance ratio of data set.

Step 2 adopts DBSCAN algorithm to carry out cluster to the normal hard disk sample in training set, then deletes noisy samples, and being merged by the faulty hard disk sample in the sample to be polymerized to bunch and training set becomes new training set.This step is based on the recognition: normal sample is usually located at comparatively close quarters, sample separation from usually relatively close to, can be polymerized to bunch; Distance between fault sample and fault sample, between fault sample and normal sample is general all relatively far away, can not be polymerized to bunch.

The noisy samples produced after deleting DBSCAN cluster, decreases original normal sample size on the one hand, reduces the imbalance ratio of new data set.On the other hand, noisy samples is away from the normal sample to be polymerized to bunch, and a kind of possible situation is, noisy samples can, closer to fault sample, when utilizing support vector cassification algorithm to classify, can make optimal separating hyper plane offset to fault sample.Delete these noisy samples, make optimal separating hyper plane to the normal shifts samples assembled, will the verification and measurement ratio of fault sample be improved.In other words, the present invention, when utilizing DBSCAN cluster to training set, expects normal hard disk sample lack sampling on the one hand, expects the difference increasing normal hard disk sample and faulty hard disk sample on the other hand.

DBSCAN clustering performance is by radius Eps and minimum to comprise data point number Minpts two parameter influences comparatively large, and these two parameters can only rule of thumb be determined usually.First the present invention tests on the training sample set comprising all normal hard disk samples and faulty hard disk sample, finds suitable parameter.Result shows, when Eps get 1, Minpts get 4 time, the normal hard disk sample size order of magnitude fewer than original normal hard disk sample size to be polymerized to bunch; When Eps get 1, Minpts get 2 time, the normal hard disk sample size half approximately fewer than original normal hard disk sample size to be polymerized to bunch, and under two kinds of value condition, faulty hard disk sample standard deviation is included in noise data.On test set 0328, choose same parameter value, the result obtained is similar.

Optimized parameter Eps=1 and Minpts=4 is chosen in step 2, utilize DBSCAN method to carry out cluster to the normal hard disk sample in training set, 4 faulty hard disk samples in 375 the normal hard disk samples to be polymerized to bunch and former training set are merged into new training dataset.

Adopt DBSCAN clustering method to normal hard disk sample clustering and after lack sampling, data set is uneven than becoming 375:4, although more originally data set greatly reduces, but detection failure hard disk on such data set, remain the abnormality detection problem on the unbalanced dataset of pole, therefore, step 3 and step 4 adopt K-means clustering algorithm to divide normal hard disk sample, again to faulty hard disk sample over-sampling, with further equilibrium criterion collection.Setting cluster number k is 7, and adopting K-means to gather in the normal hard disk sample after lack sampling is 7 classes, and the sample distribution after cluster is as shown in table 2.

K-means cluster result after table 2 training set lack sampling

By normal sample according to bunch to divide, and be merged into 7 sample sets with faulty hard disk sample respectively.Uneven than being 81:4 to the maximum in the sample set obtained, minimum is 3:1.The tag along sort of 7 sample sets is two, is respectively Clusteri and 1, wherein i=0,1 ..., 6.

Step 4 adopts SMOTE algorithm to carry out over-sampling to the faulty hard disk sample in above-mentioned 7 sample sets respectively, and arest neighbors parameter m value is 1, and sampling rate is the ratio of the normal quantity of hard disk sample and the quantity of faulty hard disk sample in each sample set.Through to faulty hard disk sample over-sampling, obtain the sample set of 7 balances, these sample sets are exactly the input training set of each sub-classifier of next step training.

Step 5, based on LIBSVM algorithm integrated on WEKA platform, chooses C-support vector machine (C-SVM), and kernel is radial basis kernel, other parameters are given tacit consent to, adopt cross validation mode, respectively using 7 balance training collection as input, training obtains 7 sub-classifiers.Data sample prediction is categorized as Clusteri or 1, wherein i=0,1 by sub-classifier ..., 6.Be predicted as 1, namely represent that this sample predictions is faulty hard disk sample, be predicted as other labels, namely represent that this sample predictions is normal sample, changing prediction tag along sort is 0.

Predicting the outcome for sub-classifier, adopts ballot integration mode to determine final predicting the outcome.Setting ballot threshold value n (n≤7), when have in 7 sub-classifiers predict that certain sample is faulty hard disk sample more than n sub-classifier time, then finally predict that this sample is faulty hard disk sample, otherwise be normal hard disk sample.The selection of threshold value n will affect final estimated performance, and table 3 shows and arranges different threshold value, the estimated performance on training set.

According to table 3, on training set, have two sub-classifiers entirely truely not predict, when ballot threshold value is less than or equal to 5, model has perfect performance.Wherein TP represents that faulty hard disk is predicted correctly the number into fault, FN represents that faulty hard disk is mispredicted for normal number, FP represents the mispredicted number for fault of normal hard disk, TN represents that normal hard disk is predicted correctly as normal number, TPR and FPR represents verification and measurement ratio and rate of false alarm respectively, and its computing formula is respectively:

TPR = \frac{TP}{TP + FN} - - - (1)

FPR = \frac{FR}{TN + FP} - - - (2)

Geometrical mean G-mean (Geometric mean) is used for weighing the overall performance of sorter, and only have when the verification and measurement ratio of faulty hard disk and normal hard disk is all higher, G-mean just can obtain higher value, and its computing formula is:

G - mean = \sqrt{TPR \cdot TNR} = \sqrt{\frac{TP}{TP + FN} \cdot \frac{TN}{TN + FP}} - - - (3)

Estimated performance (k=7) during table 3 difference ballot threshold value on training set

Utilize above-mentioned mixed strategy, the present invention obtains comparatively perfectly estimated performance on training sample set.For the generalization of verification model, then models applying is carried out estimated performance test on test set 0328 and test set 0417.

Model by twice cluster, reduces the imbalance ratio of training sample set when training, support vector machine sample set after treatment go to school acquistion to Optimal Separating Hyperplane must be partial to normal hard disk sample, thus obtain being worth on training set be 1 fault detect rate.But, because Optimal Separating Hyperplane departs from faulty hard disk sample, be partial to normal sample, if model is directly applied to test sample book collection, high rate of false alarm must be caused.While guarantee high detection rate, reduce rate of false alarm, a kind of possible way is exactly before application model, first carries out pre-service to test sample book collection, is separated by the data sample of " most likely normal sample ", does not participate in prediction from test set.According to the mixed strategy that the present invention proposes, first carrying out DBSCAN cluster to test sample book collection, is normal hard disk sample by the sample predictions to be polymerized to bunch, concentrates and is separated, utilize training pattern forecasting process after not participating in from former test sample book.

Step 6 adopts the parameter combinations of Eps=1, Minpts=2 to test set 0328 and test set 0417 cluster, is normal sample, noisy samples is separated the sample predictions to be polymerized to bunch and forms new test set respectively.New test set 0328 comprises 3774 samples, and new test set 0417 comprises 3690 samples, trains the disaggregated model obtained to predict to new test set recycling.

Step 7 is applied to new test set by training 7 sub-classifiers obtained before, and adopts ballot mode Ensemble classifier result.Table 4 and table 5 show under difference ballot threshold value value condition, the estimated performance on two test sets.

On test set 0328, when threshold value of voting meets n≤4, faulty hard disk can be predicted; On test set 0417, when n≤5, faulty hard disk can be predicted.As n=3, on test set 0328 and test set 0417, verification and measurement ratio is 100%, and overall performance index G-mean all obtains optimal value.

The integrated estimated performance of 7 sub-classifiers on test set 0328 (k=7) of table 4

The integrated estimated performance of 7 sub-classifiers on test set 0417 (k=7) of table 5

Known according to table 3, in the model training stage, there are two sub-classifiers can not to all fault sample correct Predictions in training set.These two sub-classifiers of judging by accident are got rid of outside integrated classifier, when remaining 5 sub-classifier is only considered in integrated ballot, the visible table 6 of estimated performance on two test sets and table 7.

The integrated estimated performance of 5 sub-classifiers on test set 0328 (k=7) of table 6

The integrated estimated performance of 5 sub-classifiers on test set 0417 (k=7) of table 7

As ballot threshold value n=3, test set 0328 still obtains optimum overall performance G-mean, and verification and measurement ratio keeps 100%, and rate of false alarm declines 10.5%.Test set 0417 is when n=3, and the most right rate of false alarm have dropped 14.2%, but only predicts 1 faulty hard disk, and therefore overall performance also declines to some extent.

The invention provides a kind of thinking of hard disk failure Forecasting Methodology of cloud computing platform; the method and access of this technical scheme of specific implementation is a lot; the above is only the preferred embodiment of the present invention; should be understood that; for the those of ordinary skill of this technology; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment is realized.

Claims

1. a hard disk failure Forecasting Methodology for cloud computing platform, is characterized in that, comprise the following steps:

Whether, wherein, according to the SMART observed reading of hard disk any instant, predict in this hard disk a period of time from this moment and can break down, this period is exactly hard disk failure predicted time window;

Step 2, adopts the Noise application space clustering algorithm of density based to carry out cluster to normal hard disk sample, removes the noisy samples outside clustering cluster, retains the normal hard disk sample to be polymerized to bunch;

Step 4, carries out over-sampling to the faulty hard disk sample evidence a few sample synthesis oversampling technique algorithm in each original training set, makes faulty hard disk sample in training set consistent with the quantity of normal hard disk sample, thus obtain k balance training collection;

Step 6, carries out cluster to the Noise application space clustering algorithm of test sample book centralized procurement density based, deletes the sample to be polymerized to bunch, retains noisy samples outside clustering cluster, and is normal hard disk sample by the sample predictions of deletion;

Step 7, by remaining noisy samples respectively with k the support vector machine sub-classifier prediction that the training stage obtains, and ballot determines classification results, if be judged as that the votes of faulty hard disk sample exceedes the threshold value of setting to a test sample book, then be predicted as fault, otherwise be predicted as normal.

2. the hard disk failure Forecasting Methodology of a kind of cloud computing platform according to claim 1, is characterized in that, in step 2, adopts the Noise application space clustering algorithm of density based to carry out cluster to normal hard disk sample and comprises the following steps:

Not accessed sample p in the optional normal hard disk sample set of step (21), the quantity of sample object in the neighborhood of inspection sample p radius Eps, if be more than or equal to the minimum of setting to comprise number of samples Minpts, then set up new bunch of C, by sample p and radius thereof be Eps neighborhood in all sample object add a bunch C; If be less than number of samples Minpts, then sample p is labeled as noisy samples;

A sample q do not accessed in step (22) optional bunch of C, the radius of inspection sample q is the neighborhood of Eps, if the quantity of sample object is more than or equal to the minimum of setting and comprises number of samples Minpts in its neighborhood, then the sample in sample q and neighborhood thereof is added a bunch C;

Step (23) repeats step (22), until all accessed mistake of all sample object in bunch C;

Step (24) repeats step (21) ~ (23), until all accessed mistake of all sample object in normal hard disk sample set, and is all added into one bunch or be labeled as noise;

Wherein Eps represents radius, and its value is arithmetic number, and Minpts represents minimum and comprises number of samples, and its value is the natural number being less than sample size.

3. the hard disk failure Forecasting Methodology of a kind of cloud computing platform according to claim 2, is characterized in that, in step 3, adopts K-means algorithm to carry out cluster to the normal hard disk sample after denoising and comprises the following steps:

Step (34) repeats step (32) ~ (33), until meet the condition of convergence.

4. the hard disk failure Forecasting Methodology of a kind of cloud computing platform according to claim 1, is characterized in that, in step 4, adopts a few sample synthesis oversampling technique algorithm to carry out over-sampling to faulty hard disk sample and comprises the following steps:

Step (41) calculation training concentrates the quantity T of faulty hard disk sample, and sets the quantity m of over-sampling ratio N and arest neighbors, makes N ₁=FLOOR (N/100), N ₂=N%100, wherein FLOOR is downward bracket function, and % is remainder operation;

Faulty hard disk sample set S to be sampled is initialized as sky by step (42), first repeats N ₁secondaryly in S, add all faulty hard disk samples, then make T '=(N ₂/ 100) * T, and the individual sample of Stochastic choice T ' adds S from faulty hard disk sample;

5. the hard disk failure Forecasting Methodology of a kind of cloud computing platform according to claim 4, it is characterized in that, in step 5, adopt training pattern on LIBSVM instrument balance training collection after sampling, step for: first according to required by LIBSVM software package form prepare data set, and simple zoom operations is carried out to data, consider afterwards to select Radial basis kernel function, cross validation is adopted to select optimal parameter C and g, wherein C is punishment parameter, g is nuclear parameter, finally adopt optimal parameter C and g to carry out training to whole training set and obtain supporting vector machine model, and utilize the model obtained to carry out testing and predict.By in step 4 using k balance training collection obtaining as input data set, adopt radial basis kernel, obtaining k support vector machine sub-classifier, for the prediction of faulty hard disk according to above-mentioned steps training.