CN112465153A - Disk fault prediction method based on unbalanced integrated binary classification - Google Patents

Disk fault prediction method based on unbalanced integrated binary classification Download PDF

Info

Publication number
CN112465153A
CN112465153A CN202011510541.0A CN202011510541A CN112465153A CN 112465153 A CN112465153 A CN 112465153A CN 202011510541 A CN202011510541 A CN 202011510541A CN 112465153 A CN112465153 A CN 112465153A
Authority
CN
China
Prior art keywords
samples
disk
minority
class
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011510541.0A
Other languages
Chinese (zh)
Inventor
高欣
任昺
何杨
李康生
井潇
纪维佳
查森
王�锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Publication of CN112465153A publication Critical patent/CN112465153A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a disk failure prediction method based on unbalanced integrated binary classification, which comprises the following steps: sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling; inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary; and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state. The method can effectively solve the problem of high difficulty in predicting the disk fault under the condition of unbalanced number of the positive and abnormal samples, and improves the disk fault prediction capability based on machine learning.

Description

Disk fault prediction method based on unbalanced integrated binary classification
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of information storage, in particular to a disk failure prediction method based on unbalanced integration two-classification.
[ background of the invention ]
With the continuous development of the information industry, a large amount of paper data has been electronized, and electronic data is continuously generated, so that data storage services are vigorously developed. The size of the disk in the storage system is extremely large, and the stability of the disk is related to the safety and reliability of the whole storage system in the data center. The disk is used as a component with the highest hardware failure rate, and once abnormal operation or data loss occurs, the service cannot be recovered and serious influence is caused. If the disk failure can be predicted in advance, operation and maintenance personnel can be helped to backup data, replace disks and the like in advance, and risks can be greatly avoided or losses can be greatly reduced. At present, disk manufacturers all adopt SMART (Self-Monitoring Analysis and Reporting Technology) to monitor disks, but the fault detection rate of the traditional threshold value judgment method is too low, and the actual early warning effect is not good. The disk failure prediction method based on machine learning enables a satisfactory prediction effect to be obtained through strong learning capability of the model. The method mainly adopts an unbalanced classification method, and needs to collect a large amount of SMART data of healthy and fault disks, and train a classification model after feature extraction is carried out on the data. Many approaches have been proposed to address the problem of unbalanced classification, mainly classified as data-level approaches, algorithm-level approaches, and approaches that combine data processing with algorithms. The method combining data processing and algorithm has better performance in the unbalanced classification problem, but the method does not fully consider the data distribution of a sample space, cannot improve the classification performance by adopting different classifiers in different areas, adopts a simple static strategy to select a model, does not predict a test object separately, and reduces the applicability of the model.
[ summary of the invention ]
In view of this, the embodiment of the present invention provides a disk failure prediction method based on unbalanced two-classification integration, which can effectively solve the problem of high difficulty in predicting disk failures when the number of normal and abnormal samples is unbalanced, and improve the performance of unbalanced classification by adjusting data distribution to generate different classification models, thereby improving the disk failure prediction capability based on machine learning.
The embodiment of the invention provides a disk fault prediction method based on unbalanced integration two-classification, which comprises the following steps:
sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling;
inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary;
and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state.
In the method, SMART data of a disk is sampled, state characteristics related to disk faults are selected as an original data set, and a balanced data set is obtained by data partition mixed sampling: jump analysis and value domain analysis are carried out on the disk data, the characteristics which are related to disk faults and have larger entropy values are selected, the characteristic values are collected to be used as an original data set D, and firstly, the data marked as normal and fault in the original data set D are divided into a plurality of sets DmajAnd minority class set DminDefining a boundary region DborderMinority noise-like region Ddanger-Security of minority classZone Dsafe-A plurality of safety zones Dsafe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set DminMinority class set DminIncluding a minority class of samples xi,i=1,2,...,Nmin,NminFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is countedi-,i=1,2,...,NminWhere k is 13, and storing most of the samples in the neighboring points to the boundary region DborderIn (1), calculating majority class ratio in minority class neighborhood
Figure BDA0002846267110000021
If t is 0, adding the minority samples into the minority noise region Ddanger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region Dborder(ii) a If gamma is 1, adding the minority class sample into the minority class safe area Dsafe-(ii) a Adding the samples of the residual training set D into a majority class safety zone Dsafe+Training set D eliminating few noise regions Ddanger-Obtaining a filtered set DfilterCounting the boundary region DborderNumber of samples NborderIncluding the number m of minority samples and the number n of majority samples, in the minority samples xiIn the boundary region DborderA few class samples xborder_i1, 2.. m, for each xborder_iNumber N of majority class samples in neighborhoodi+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples xborder_iCalculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion as
Figure BDA0002846267110000031
According to the majority of each minority class neighborhoodThe ratio of the class occupation ratio to the total generates the weight
Figure BDA0002846267110000032
Calculating the resultant number g for each minority sample of the boundary regioni=wiX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region DborderObtaining boundary region oversampling set DBorder crossA plurality of types of security zones Dsafe+Is polymerized into
Figure BDA0002846267110000033
A cluster of Nsafe+Is a majority class security zone Dsafe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtainedsafe + oweEliminating noise regions D of minority classesdanger-Keeping few class safe zone samples Dsafe-Finally, a balanced data set D is obtainedbalance=Dsafe-+Dsafe + owe+DBorder cross
In the method, the original data set and the balanced data set of the disk are input into an RF algorithm for machine learning, an original model biased to a plurality of categories and a local region reinforcement and weakening model are respectively trained, and a method for integrating the two models to obtain a mixed model biased to a peripheral boundary comprises the following steps: training a majority of biased original models, a local area strengthening and weakening model and a biased peripheral boundary mixed model by adopting a random forest RF algorithm, wherein the main parameters of the algorithm are the number n _ estimators of a decision tree which is 1000, the criterion of a decision tree splitting mode which is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree which is 2 and the minimum sample number min _ samples _ leaf which is required by leaf nodes which is 1; obtaining original model RF biased to most classes through unbalanced training set D training1Wherein forest size s is 100; by balancing the data set DbalanceTraining to obtain local region reinforcement and attenuation model RF2Wherein forest size s is 100; will be biased towards the original model RF of the majority class1And a local areaReinforcing and attenuating the model RF2The integration of all base classifiers of (a) obtains a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5;
in the method, three models are selected in a self-adaptive manner according to the unbalance degree of the adjacent neighbor of the original disk data set, and the method for predicting the disk fault state by the obtained classification probability comprises the following steps: giving a test set T, searching the neighbors of each test point in the original disk data through a kNN algorithm, and counting the number T of a plurality of classes in the neighborsi+And calculating the unbalance degree of the neighbor of the test point
Figure BDA0002846267110000041
If x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected1Carrying out prediction; if it is not
Figure BDA0002846267110000042
Dividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is not
Figure BDA0002846267110000043
Dividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type2And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:
Figure BDA0002846267110000044
wherein I () is a function of indication,
Figure BDA0002846267110000045
represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, ht(x) Representing the result of the t decision tree, x representing the test point, y representing two categories including few category 0 and manyThe classes 1 and 1 indicate that the normal probability of disk prediction is high, and 0 indicates that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a flow chart of a disk failure prediction method based on unbalanced integration two-classification according to an embodiment of the present invention;
fig. 2 is a flowchart of a framework of a method for predicting a disk failure based on unbalanced integration two-classification according to an embodiment of the present invention.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a disk failure prediction method based on unbalanced integrated binary classification, please refer to fig. 1, which is a schematic flow chart of the disk failure prediction method based on unbalanced integrated binary classification according to the embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step 101, SMART data of a disk is sampled, state features related to disk faults are selected as an original data set, and a balanced data set is obtained through data partition mixed sampling.
Specifically, the SMART data set used is from Backblaze company, and SMART is a group of disk self-detection and state monitoring analysis technologyThe method is characterized in that the method is a set of standards established by a disk manufacturer, all monitored and recorded data are called SMART data, jump analysis and value domain analysis are carried out on the SMART data of a disk, characteristics which are related to disk faults and have larger entropy values are selected, and the attributes of the SMART of the disk comprise: reading error rate of bottom layer data, starting and stopping counting, remapping sector counting, seeking error rate, electrifying time accumulation, uncorrectable error, command overtime, magnetic head loading/unloading counting, temperature, current to-be-mapped sector counting, offline uncorrectable sector counting, magnetic head flight time/transmission error rate, LBA writing total number and LBA reading total number, acquiring label data of good and bad disks from a monitoring system after collecting data of the disks, collecting the label data and 14 magnetic disk SMART attribute characteristic values as an original data set D, firstly dividing the data marked as normal and fault in the original data set D into a plurality of sets DmajAnd minority class set DminDefining a boundary region DborderMinority noise-like region Ddanger-Minority class security zone Dsafe-A plurality of safety zones Dsafe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set DminMinority class set DminIncluding a minority class of samples xi,i=1,2,...,Nmin,NminFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is countedi-,i=1,2,...,NminWhere k is 13, and storing most of the samples in the neighboring points to the boundary region DborderIn (1), calculating majority class ratio in minority class neighborhood
Figure BDA0002846267110000061
If t is 0, adding the minority samples into the minority noise region Ddanger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region Dborder(ii) a If gamma is 1, adding the minority class sample into the minority class safe area Dsafe-(ii) a Adding the samples of the residual training set D into a majority class safety zone Dsafe+Training set D eliminating few noise regions Ddanger-Obtaining a filtered set DfilterCounting the boundary region DborderNumber of samples NborderIncluding the number m of minority samples and the number n of majority samples, in the minority samples xiIn the boundary region DborderA few class samples xborder_i1, 2.. m, for each xborder_iNumber N of majority class samples in neighborhoodi+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples xborder_iCalculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion as
Figure BDA0002846267110000062
Figure BDA0002846267110000063
Generating a weight value according to the ratio of the majority class occupation ratio of each minority class neighborhood to the sum of the majority class occupation ratios
Figure BDA0002846267110000064
Calculating the resultant number g for each minority sample of the boundary regioni=wiX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region DborderObtaining boundary region oversampling set DBorder crossAlgorithm 1 is a pseudo code of the region partition algorithm of step 101:
Figure BDA0002846267110000065
Figure BDA0002846267110000071
securing a plurality of classesZone Dsafe+Is polymerized into
Figure BDA0002846267110000081
A cluster of Nsafe+Is a majority class security zone Dsafe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtainedsafe + oweEliminating noise regions D of minority classesdanger-Keeping few class safe zone samples Dsafe-Finally, a balanced data set D is obtainedbalance=Dsafe-+Dsafe + owe+DBorder cross
And 102, inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to a plurality of categories and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary.
Specifically, a majority of types of original models, local area reinforcement and weakening models and a periphery boundary biased mixed model are trained by adopting a random forest RF algorithm, the main parameters of the algorithm are that the number n _ estimators of a decision tree is 1000, the criterion of the decision tree splitting mode is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree is 2, and the minimum sample number min _ samples _ leaf required by leaf nodes is 1; obtaining original model RF biased to most classes through unbalanced training set D training1Wherein forest size s is 100; by balancing the data set DbalanceTraining to obtain local region reinforcement and attenuation model RF2Wherein forest size s is 100; will be biased towards the original model RF of the majority class1And local region reinforcement and attenuation model RF2The integration of all base classifiers of (a) yields a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5.
And 103, adaptively selecting three models according to the unbalance degree of the neighbor of the original disk data set, wherein the obtained classification probability is used for predicting the disk fault state.
Specifically, given a disk test data set T, each test point is searched for through a kNN algorithmNeighbor neighbors in the original disk data and counting the number T of most classes in the neighbor neighborsi+And calculating the unbalance degree of the neighbor of the test point
Figure BDA0002846267110000082
If x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected1Carrying out prediction; if it is not
Figure BDA0002846267110000083
Dividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is not
Figure BDA0002846267110000084
Dividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type2And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:
Figure BDA0002846267110000091
wherein I () is a function of indication,
Figure BDA0002846267110000092
represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, ht(x) And (3) representing the result of the t-th decision tree, x representing a test point, y representing two categories including a minority category 0 and a majority category 1, 1 representing that the normal probability of disk prediction is high, and 0 representing that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.
Algorithm 2 is the pseudo code dynamically selected for the model of step 103:
Figure BDA0002846267110000093
Figure BDA0002846267110000101
algorithm 3 is a pseudo code of a disk failure prediction method based on unbalanced integration two classes:
Figure BDA0002846267110000102
Figure BDA0002846267110000111
the table includes that the embodiment of the invention provides a public data set applied to a disk failure prediction method based on unbalanced integration two-classification, the detailed information of the data set is described, the detailed information includes feature numbers, data distribution (the number of majority samples and the number of minority samples) and an unbalanced rate (the ratio of the number of majority samples to the number of minority samples), and the table includes a SMART attribute list screened by disk data.
Watch 1
Figure BDA0002846267110000112
Watch two
Figure BDA0002846267110000113
Figure BDA0002846267110000121
The third table is a comparison experiment result of F-measure values (minority class recall ratio and precision ratio harmonic mean value) when the disk failure prediction method based on the unbalanced integrated binary classification is used for solving classification of 10 public data sets and failure prediction of disks, wherein the comparison method in the embodiment of the invention is six methods of RUSboost, SMOTEboost, easylensesemble, BalancedBagging, BRAF and DTE-SBD, which are used for typically solving the unbalanced binary classification problem. From the table III, the method DPHS-MDS provided by the invention has obviously improved F-measure value in the public data set and the disk data set compared with the comparison method. Particularly, the average result of the 10 groups of data sets and the disk data sets is obviously improved by the method, which shows that the disk failure prediction performance is obviously improved. The method provided by the embodiment of the invention makes a certain breakthrough in the aspect of disk failure prediction.
Watch III
Figure BDA0002846267110000122
In summary, the embodiments of the present invention have the following beneficial effects:
in the technical scheme of the implementation of the invention, SMART data of a disk is sampled, state characteristics related to disk faults are selected as an original data set, and a balanced data set is obtained through data partitioning mixed sampling; inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary; and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state. According to the technical scheme provided by the embodiment of the invention, the problem of high difficulty in predicting the disk fault under the condition of unbalanced number of the normal and abnormal samples can be effectively solved, different classification models are generated by adjusting data distribution so as to improve the performance of unbalanced classification, and the disk fault prediction capability based on machine learning is improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (4)

1. A disk failure prediction method based on unbalanced integrated binary classification is characterized by comprising the following steps:
(1) sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling;
(2) inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary;
(3) and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state.
2. The method according to claim 1, wherein the SMART data of the disk is sampled, the state features related to the disk failure are selected as an original data set, and a balanced data set is obtained by data partition hybrid sampling, which is specifically described as follows: jump analysis and value domain analysis are carried out on the disk data, the characteristics which are related to disk faults and have larger entropy values are selected, the characteristic values are collected to be used as an original data set D, and firstly, the data marked as normal and fault in the original data set D are divided into a plurality of sets DmajAnd minority class set DminDefining a boundary region DborderMinority noise-like region Ddanger-Minority class security zone Dsafe-A plurality of safety zones Dsafe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set DminMinority class set DminIncluding a minority class of samples xi,i=1,2,...,Nmin,NminFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is countedi-,i=1,2,...,NminWhere k is 13, and storing most of the samples in the neighboring points to the boundary region DborderIn (1), calculating majority class ratio in minority class neighborhood
Figure FDA0002846267100000011
If t is 0, adding the minority samples into the minority noise region Ddanger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region Dborder(ii) a If gamma is 1, adding the minority class sample into the minority class safe area Dsafe-(ii) a Adding the samples of the residual training set D into a majority class safety zone Dsafe+Training set D eliminating few noise regions Ddanger-Obtaining a filtered set DfilterCounting the boundary region DborderNumber of samples NborderIncluding the number m of minority samples and the number n of majority samples, in the minority samples xiIn the boundary region DborderA few class samples xborder_i1, 2.. m, for each xborder_iNumber N of majority class samples in neighborhoodi+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples xborder_iCalculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion as
Figure FDA0002846267100000021
Generating a weight value according to the ratio of the majority class occupation ratio of each minority class neighborhood to the sum of the majority class occupation ratios
Figure FDA0002846267100000022
Calculating the resultant number g for each minority sample of the boundary regioni=wiX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region DborderObtaining boundary region oversampling set DBorder crossA plurality of types of security zones Dsafe+Is polymerized into
Figure FDA0002846267100000023
A cluster of Nsafe+Is a majority class security zone Dsafe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtainedsafe + oweEliminating noise regions D of minority classesdanger-Keeping few class safe zone samples Dsafe-Finally, a balanced data set D is obtainedbalance=Dsafe-+Dsafe + owe+DBorder cross
3. The method of claim 1, wherein the original data set and the balanced data set of the disk are inputted into an RF algorithm for machine learning, and the original model biased to a plurality of categories and the local region reinforcement and weakening model are trained respectively, and the two models are integrated to obtain a mixed model biased to the peripheral boundary, specifically: training a majority of biased original models, a local area strengthening and weakening model and a biased peripheral boundary mixed model by adopting a random forest RF algorithm, wherein the main parameters of the algorithm are the number n _ estimators of a decision tree which is 1000, the criterion of a decision tree splitting mode which is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree which is 2 and the minimum sample number min _ samples _ leaf which is required by leaf nodes which is 1; obtaining original model RF biased to most classes through unbalanced training set D training1Wherein forest size s is 100; by balancing the data set DbalanceTraining to obtain local region reinforcement and attenuation model RF2Wherein forest size s is 100; will be biased towards the original model RF of the majority class1And local region reinforcement and attenuation model RF2The integration of all base classifiers of (a) yields a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5.
4. The method of claim 1, wherein the three models are adaptively selected based on a degree of imbalance placed in close proximity to the raw disk data set, and the classification probabilities obtained are used to predictThe disk failure state is specifically described as follows: giving a test set T, searching the neighbors of each test point in the original disk data through a kNN algorithm, and counting the number T of a plurality of classes in the neighborsi+And calculating the unbalance degree of the neighbor of the test point
Figure FDA0002846267100000031
If x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected1Carrying out prediction; if it is not
Figure FDA0002846267100000032
Dividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is not
Figure FDA0002846267100000033
Dividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type2And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:
Figure FDA0002846267100000034
wherein I () is a function of indication,
Figure FDA0002846267100000035
represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, ht(x) And (3) representing the result of the t-th decision tree, x representing a test point, y representing two categories including a minority category 0 and a majority category 1, 1 representing that the normal probability of disk prediction is high, and 0 representing that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.
CN202011510541.0A 2019-12-23 2020-12-18 Disk fault prediction method based on unbalanced integrated binary classification Pending CN112465153A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911339988.3A CN111091201A (en) 2019-12-23 2019-12-23 Data partition mixed sampling-based unbalanced integrated classification method
CN2019113399883 2019-12-23

Publications (1)

Publication Number Publication Date
CN112465153A true CN112465153A (en) 2021-03-09

Family

ID=70395790

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201911339988.3A Pending CN111091201A (en) 2019-12-23 2019-12-23 Data partition mixed sampling-based unbalanced integrated classification method
CN202011510541.0A Pending CN112465153A (en) 2019-12-23 2020-12-18 Disk fault prediction method based on unbalanced integrated binary classification

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201911339988.3A Pending CN111091201A (en) 2019-12-23 2019-12-23 Data partition mixed sampling-based unbalanced integrated classification method

Country Status (1)

Country Link
CN (2) CN111091201A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434401A (en) * 2021-06-24 2021-09-24 杭州电子科技大学 Software defect prediction method based on sample distribution characteristics and SPY algorithm
CN113591896A (en) * 2021-05-18 2021-11-02 广西电网有限责任公司电力科学研究院 Power grid attack event classification detection method

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091201A (en) * 2019-12-23 2020-05-01 北京邮电大学 Data partition mixed sampling-based unbalanced integrated classification method
CN112364706A (en) * 2020-10-19 2021-02-12 燕山大学 Small sample bearing fault diagnosis method based on class imbalance
CN112365060B (en) * 2020-11-13 2024-01-26 广东电力信息科技有限公司 Preprocessing method for network Internet of things sensing data
CN112508243B (en) * 2020-11-25 2022-09-09 国网浙江省电力有限公司信息通信分公司 Training method and device for multi-fault prediction network model of power information system
CN112800917B (en) * 2021-01-21 2022-07-19 华北电力大学(保定) Circuit breaker unbalance monitoring data set oversampling method
CN112836735B (en) * 2021-01-27 2023-09-01 中山大学 Method for processing unbalanced data set by optimized random forest
CN112633426B (en) * 2021-03-11 2021-06-15 腾讯科技(深圳)有限公司 Method and device for processing data class imbalance, electronic equipment and storage medium
CN114612255B (en) * 2022-04-08 2023-11-07 湖南提奥医疗科技有限公司 Insurance pricing method based on electronic medical record data feature selection
CN114969669B (en) * 2022-07-27 2022-11-15 深圳前海环融联易信息科技服务有限公司 Data balance degree processing method, joint modeling system, device and medium
CN115374858B (en) * 2022-08-24 2024-05-14 东北大学 Intelligent diagnosis method for flow industrial production quality based on hybrid integrated model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359704A (en) * 2018-12-26 2019-02-19 北京邮电大学 A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision
CN111091201A (en) * 2019-12-23 2020-05-01 北京邮电大学 Data partition mixed sampling-based unbalanced integrated classification method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359704A (en) * 2018-12-26 2019-02-19 北京邮电大学 A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision
CN111091201A (en) * 2019-12-23 2020-05-01 北京邮电大学 Data partition mixed sampling-based unbalanced integrated classification method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591896A (en) * 2021-05-18 2021-11-02 广西电网有限责任公司电力科学研究院 Power grid attack event classification detection method
CN113434401A (en) * 2021-06-24 2021-09-24 杭州电子科技大学 Software defect prediction method based on sample distribution characteristics and SPY algorithm
CN113434401B (en) * 2021-06-24 2022-10-28 杭州电子科技大学 Software defect prediction method based on sample distribution characteristics and SPY algorithm

Also Published As

Publication number Publication date
CN111091201A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN112465153A (en) Disk fault prediction method based on unbalanced integrated binary classification
CN108986869B (en) Disk fault detection method using multi-model prediction
CN110703057B (en) Power equipment partial discharge diagnosis method based on data enhancement and neural network
CN104503874A (en) Hard disk failure prediction method for cloud computing platform
Chien et al. A system for online detection and classification of wafer bin map defect patterns for manufacturing intelligence
Yu et al. Pareto-optimal adaptive loss residual shrinkage network for imbalanced fault diagnostics of machines
CN106682688A (en) Pile-up noise reduction own coding network bearing fault diagnosis method based on particle swarm optimization
CN107168995B (en) Data processing method and server
CN105760889A (en) Efficient imbalanced data set classification method
CN112951311B (en) Hard disk fault prediction method and system based on variable weight random forest
CN112214369A (en) Hard disk fault prediction model establishing method based on model fusion and application thereof
CN103941131A (en) Transformer fault detecting method based on simplified set unbalanced SVM (support vector machine)
CN111881289B (en) Training method of classification model, and detection method and device of data risk class
CN112365060B (en) Preprocessing method for network Internet of things sensing data
KR102144010B1 (en) Methods and apparatuses for processing data based on representation model for unbalanced data
CN116582300A (en) Network traffic classification method and device based on machine learning
Bhat et al. An empirical evaluation of defect prediction approaches in within-project and cross-project context
CN112699936B (en) Electric power CPS generalized false data injection attack identification method
CN110673997B (en) Disk failure prediction method and device
CN110991241B (en) Abnormality recognition method, apparatus, and computer-readable medium
CN115438239A (en) Abnormity detection method and device for automatic abnormal sample screening
Becker et al. Rough set theory in the classification of loan applications
CN115545111B (en) Network intrusion detection method and system based on clustering self-adaptive mixed sampling
Zhihao et al. Comparison of the different sampling techniques for imbalanced classification problems in machine learning
CN114756420A (en) Fault prediction method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210309

WD01 Invention patent application deemed withdrawn after publication