CN112465153A

CN112465153A - Disk fault prediction method based on unbalanced integrated binary classification

Info

Publication number: CN112465153A
Application number: CN202011510541.0A
Authority: CN
Inventors: 高欣; 任昺; 何杨; 李康生; 井潇; 纪维佳; 查森; 王�锋
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-12-23
Filing date: 2020-12-18
Publication date: 2021-03-09
Also published as: CN111091201A

Abstract

The invention discloses a disk failure prediction method based on unbalanced integrated binary classification, which comprises the following steps: sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling; inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary; and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state. The method can effectively solve the problem of high difficulty in predicting the disk fault under the condition of unbalanced number of the positive and abnormal samples, and improves the disk fault prediction capability based on machine learning.

Description

Disk fault prediction method based on unbalanced integrated binary classification

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of information storage, in particular to a disk failure prediction method based on unbalanced integration two-classification.

[ background of the invention ]

With the continuous development of the information industry, a large amount of paper data has been electronized, and electronic data is continuously generated, so that data storage services are vigorously developed. The size of the disk in the storage system is extremely large, and the stability of the disk is related to the safety and reliability of the whole storage system in the data center. The disk is used as a component with the highest hardware failure rate, and once abnormal operation or data loss occurs, the service cannot be recovered and serious influence is caused. If the disk failure can be predicted in advance, operation and maintenance personnel can be helped to backup data, replace disks and the like in advance, and risks can be greatly avoided or losses can be greatly reduced. At present, disk manufacturers all adopt SMART (Self-Monitoring Analysis and Reporting Technology) to monitor disks, but the fault detection rate of the traditional threshold value judgment method is too low, and the actual early warning effect is not good. The disk failure prediction method based on machine learning enables a satisfactory prediction effect to be obtained through strong learning capability of the model. The method mainly adopts an unbalanced classification method, and needs to collect a large amount of SMART data of healthy and fault disks, and train a classification model after feature extraction is carried out on the data. Many approaches have been proposed to address the problem of unbalanced classification, mainly classified as data-level approaches, algorithm-level approaches, and approaches that combine data processing with algorithms. The method combining data processing and algorithm has better performance in the unbalanced classification problem, but the method does not fully consider the data distribution of a sample space, cannot improve the classification performance by adopting different classifiers in different areas, adopts a simple static strategy to select a model, does not predict a test object separately, and reduces the applicability of the model.

[ summary of the invention ]

In view of this, the embodiment of the present invention provides a disk failure prediction method based on unbalanced two-classification integration, which can effectively solve the problem of high difficulty in predicting disk failures when the number of normal and abnormal samples is unbalanced, and improve the performance of unbalanced classification by adjusting data distribution to generate different classification models, thereby improving the disk failure prediction capability based on machine learning.

The embodiment of the invention provides a disk fault prediction method based on unbalanced integration two-classification, which comprises the following steps:

sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling;

inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary;

and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state.

In the method, SMART data of a disk is sampled, state characteristics related to disk faults are selected as an original data set, and a balanced data set is obtained by data partition mixed sampling: jump analysis and value domain analysis are carried out on the disk data, the characteristics which are related to disk faults and have larger entropy values are selected, the characteristic values are collected to be used as an original data set D, and firstly, the data marked as normal and fault in the original data set D are divided into a plurality of sets D_majAnd minority class set D_minDefining a boundary region D_borderMinority noise-like region D_danger-Security of minority classZone D_safe-A plurality of safety zones D_safe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set D_minMinority class set D_minIncluding a minority class of samples x_i，i＝1,2,...,N_min，N_minFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is counted_i-，i＝1,2,...,N_minWhere k is 13, and storing most of the samples in the neighboring points to the boundary region D_borderIn (1), calculating majority class ratio in minority class neighborhood

If t is 0, adding the minority samples into the minority noise region D_danger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region D_border(ii) a If gamma is 1, adding the minority class sample into the minority class safe area D_safe-(ii) a Adding the samples of the residual training set D into a majority class safety zone D_safe+Training set D eliminating few noise regions D_danger-Obtaining a filtered set D_filterCounting the boundary region D_borderNumber of samples N_borderIncluding the number m of minority samples and the number n of majority samples, in the minority samples x_iIn the boundary region D_borderA few class samples x_{border_i}1, 2.. m, for each x_{border_i}Number N of majority class samples in neighborhood_i+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples x_{border_i}Calculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion as

According to the majority of each minority class neighborhoodThe ratio of the class occupation ratio to the total generates the weight

Calculating the resultant number g for each minority sample of the boundary region_i＝w_iX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region D_borderObtaining boundary region oversampling set D_{Border cross}A plurality of types of security zones D_safe+Is polymerized into

A cluster of N_safe+Is a majority class security zone D_safe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtained_{safe + owe}Eliminating noise regions D of minority classes_danger-Keeping few class safe zone samples D_safe-Finally, a balanced data set D is obtained_balance＝D_safe-+D_{safe + owe}+D_{Border cross}；

In the method, the original data set and the balanced data set of the disk are input into an RF algorithm for machine learning, an original model biased to a plurality of categories and a local region reinforcement and weakening model are respectively trained, and a method for integrating the two models to obtain a mixed model biased to a peripheral boundary comprises the following steps: training a majority of biased original models, a local area strengthening and weakening model and a biased peripheral boundary mixed model by adopting a random forest RF algorithm, wherein the main parameters of the algorithm are the number n _ estimators of a decision tree which is 1000, the criterion of a decision tree splitting mode which is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree which is 2 and the minimum sample number min _ samples _ leaf which is required by leaf nodes which is 1; obtaining original model RF biased to most classes through unbalanced training set D training₁Wherein forest size s is 100; by balancing the data set D_balanceTraining to obtain local region reinforcement and attenuation model RF₂Wherein forest size s is 100; will be biased towards the original model RF of the majority class₁And a local areaReinforcing and attenuating the model RF₂The integration of all base classifiers of (a) obtains a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5;

in the method, three models are selected in a self-adaptive manner according to the unbalance degree of the adjacent neighbor of the original disk data set, and the method for predicting the disk fault state by the obtained classification probability comprises the following steps: giving a test set T, searching the neighbors of each test point in the original disk data through a kNN algorithm, and counting the number T of a plurality of classes in the neighbors_i+And calculating the unbalance degree of the neighbor of the test point

If x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected₁Carrying out prediction; if it is not

Dividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is not

Dividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type₂And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:

wherein I () is a function of indication,

represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, h_t(x) Representing the result of the t decision tree, x representing the test point, y representing two categories including few category 0 and manyThe classes 1 and 1 indicate that the normal probability of disk prediction is high, and 0 indicates that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flow chart of a disk failure prediction method based on unbalanced integration two-classification according to an embodiment of the present invention;

fig. 2 is a flowchart of a framework of a method for predicting a disk failure based on unbalanced integration two-classification according to an embodiment of the present invention.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a disk failure prediction method based on unbalanced integrated binary classification, please refer to fig. 1, which is a schematic flow chart of the disk failure prediction method based on unbalanced integrated binary classification according to the embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

step 101, SMART data of a disk is sampled, state features related to disk faults are selected as an original data set, and a balanced data set is obtained through data partition mixed sampling.

Specifically, the SMART data set used is from Backblaze company, and SMART is a group of disk self-detection and state monitoring analysis technologyThe method is characterized in that the method is a set of standards established by a disk manufacturer, all monitored and recorded data are called SMART data, jump analysis and value domain analysis are carried out on the SMART data of a disk, characteristics which are related to disk faults and have larger entropy values are selected, and the attributes of the SMART of the disk comprise: reading error rate of bottom layer data, starting and stopping counting, remapping sector counting, seeking error rate, electrifying time accumulation, uncorrectable error, command overtime, magnetic head loading/unloading counting, temperature, current to-be-mapped sector counting, offline uncorrectable sector counting, magnetic head flight time/transmission error rate, LBA writing total number and LBA reading total number, acquiring label data of good and bad disks from a monitoring system after collecting data of the disks, collecting the label data and 14 magnetic disk SMART attribute characteristic values as an original data set D, firstly dividing the data marked as normal and fault in the original data set D into a plurality of sets D_majAnd minority class set D_minDefining a boundary region D_borderMinority noise-like region D_danger-Minority class security zone D_safe-A plurality of safety zones D_safe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set D_minMinority class set D_minIncluding a minority class of samples x_i，i＝1,2,...,N_min，N_minFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is counted_i-，i＝1,2,...,N_minWhere k is 13, and storing most of the samples in the neighboring points to the boundary region D_borderIn (1), calculating majority class ratio in minority class neighborhood

Generating a weight value according to the ratio of the majority class occupation ratio of each minority class neighborhood to the sum of the majority class occupation ratios

Calculating the resultant number g for each minority sample of the boundary region_i＝w_iX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region D_borderObtaining boundary region oversampling set D_{Border cross}Algorithm 1 is a pseudo code of the region partition algorithm of step 101:

securing a plurality of classesZone D_safe+Is polymerized into

A cluster of N_safe+Is a majority class security zone D_safe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtained_{safe + owe}Eliminating noise regions D of minority classes_danger-Keeping few class safe zone samples D_safe-Finally, a balanced data set D is obtained_balance＝D_safe-+D_{safe + owe}+D_{Border cross}。

And 102, inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to a plurality of categories and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary.

Specifically, a majority of types of original models, local area reinforcement and weakening models and a periphery boundary biased mixed model are trained by adopting a random forest RF algorithm, the main parameters of the algorithm are that the number n _ estimators of a decision tree is 1000, the criterion of the decision tree splitting mode is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree is 2, and the minimum sample number min _ samples _ leaf required by leaf nodes is 1; obtaining original model RF biased to most classes through unbalanced training set D training₁Wherein forest size s is 100; by balancing the data set D_balanceTraining to obtain local region reinforcement and attenuation model RF₂Wherein forest size s is 100; will be biased towards the original model RF of the majority class₁And local region reinforcement and attenuation model RF₂The integration of all base classifiers of (a) yields a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5.

And 103, adaptively selecting three models according to the unbalance degree of the neighbor of the original disk data set, wherein the obtained classification probability is used for predicting the disk fault state.

Specifically, given a disk test data set T, each test point is searched for through a kNN algorithmNeighbor neighbors in the original disk data and counting the number T of most classes in the neighbor neighbors_i+And calculating the unbalance degree of the neighbor of the test point

wherein I () is a function of indication,

represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, h_t(x) And (3) representing the result of the t-th decision tree, x representing a test point, y representing two categories including a minority category 0 and a majority category 1, 1 representing that the normal probability of disk prediction is high, and 0 representing that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.

Algorithm 2 is the pseudo code dynamically selected for the model of step 103:

algorithm 3 is a pseudo code of a disk failure prediction method based on unbalanced integration two classes:

the table includes that the embodiment of the invention provides a public data set applied to a disk failure prediction method based on unbalanced integration two-classification, the detailed information of the data set is described, the detailed information includes feature numbers, data distribution (the number of majority samples and the number of minority samples) and an unbalanced rate (the ratio of the number of majority samples to the number of minority samples), and the table includes a SMART attribute list screened by disk data.

Watch 1

Watch two

The third table is a comparison experiment result of F-measure values (minority class recall ratio and precision ratio harmonic mean value) when the disk failure prediction method based on the unbalanced integrated binary classification is used for solving classification of 10 public data sets and failure prediction of disks, wherein the comparison method in the embodiment of the invention is six methods of RUSboost, SMOTEboost, easylensesemble, BalancedBagging, BRAF and DTE-SBD, which are used for typically solving the unbalanced binary classification problem. From the table III, the method DPHS-MDS provided by the invention has obviously improved F-measure value in the public data set and the disk data set compared with the comparison method. Particularly, the average result of the 10 groups of data sets and the disk data sets is obviously improved by the method, which shows that the disk failure prediction performance is obviously improved. The method provided by the embodiment of the invention makes a certain breakthrough in the aspect of disk failure prediction.

Watch III

In summary, the embodiments of the present invention have the following beneficial effects:

in the technical scheme of the implementation of the invention, SMART data of a disk is sampled, state characteristics related to disk faults are selected as an original data set, and a balanced data set is obtained through data partitioning mixed sampling; inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary; and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state. According to the technical scheme provided by the embodiment of the invention, the problem of high difficulty in predicting the disk fault under the condition of unbalanced number of the normal and abnormal samples can be effectively solved, different classification models are generated by adjusting data distribution so as to improve the performance of unbalanced classification, and the disk fault prediction capability based on machine learning is improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A disk failure prediction method based on unbalanced integrated binary classification is characterized by comprising the following steps:

(1) sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling;

(2) inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary;

(3) and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state.

2. The method according to claim 1, wherein the SMART data of the disk is sampled, the state features related to the disk failure are selected as an original data set, and a balanced data set is obtained by data partition hybrid sampling, which is specifically described as follows: jump analysis and value domain analysis are carried out on the disk data, the characteristics which are related to disk faults and have larger entropy values are selected, the characteristic values are collected to be used as an original data set D, and firstly, the data marked as normal and fault in the original data set D are divided into a plurality of sets D_majAnd minority class set D_minDefining a boundary region D_borderMinority noise-like region D_danger-Minority class security zone D_safe-A plurality of safety zones D_safe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set D_minMinority class set D_minIncluding a minority class of samples x_i，i＝1,2,...,N_min，N_minFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is counted_i-，i＝1,2,...,N_minWhere k is 13, and storing most of the samples in the neighboring points to the boundary region D_borderIn (1), calculating majority class ratio in minority class neighborhood

3. The method of claim 1, wherein the original data set and the balanced data set of the disk are inputted into an RF algorithm for machine learning, and the original model biased to a plurality of categories and the local region reinforcement and weakening model are trained respectively, and the two models are integrated to obtain a mixed model biased to the peripheral boundary, specifically: training a majority of biased original models, a local area strengthening and weakening model and a biased peripheral boundary mixed model by adopting a random forest RF algorithm, wherein the main parameters of the algorithm are the number n _ estimators of a decision tree which is 1000, the criterion of a decision tree splitting mode which is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree which is 2 and the minimum sample number min _ samples _ leaf which is required by leaf nodes which is 1; obtaining original model RF biased to most classes through unbalanced training set D training₁Wherein forest size s is 100; by balancing the data set D_balanceTraining to obtain local region reinforcement and attenuation model RF₂Wherein forest size s is 100; will be biased towards the original model RF of the majority class₁And local region reinforcement and attenuation model RF₂The integration of all base classifiers of (a) yields a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5.

4. The method of claim 1, wherein the three models are adaptively selected based on a degree of imbalance placed in close proximity to the raw disk data set, and the classification probabilities obtained are used to predictThe disk failure state is specifically described as follows: giving a test set T, searching the neighbors of each test point in the original disk data through a kNN algorithm, and counting the number T of a plurality of classes in the neighbors_i+And calculating the unbalance degree of the neighbor of the test point

wherein I () is a function of indication,