CN112465153A - Disk fault prediction method based on unbalanced integrated binary classification - Google Patents
Disk fault prediction method based on unbalanced integrated binary classification Download PDFInfo
- Publication number
- CN112465153A CN112465153A CN202011510541.0A CN202011510541A CN112465153A CN 112465153 A CN112465153 A CN 112465153A CN 202011510541 A CN202011510541 A CN 202011510541A CN 112465153 A CN112465153 A CN 112465153A
- Authority
- CN
- China
- Prior art keywords
- samples
- disk
- minority
- class
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a disk failure prediction method based on unbalanced integrated binary classification, which comprises the following steps: sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling; inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary; and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state. The method can effectively solve the problem of high difficulty in predicting the disk fault under the condition of unbalanced number of the positive and abnormal samples, and improves the disk fault prediction capability based on machine learning.
Description
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of information storage, in particular to a disk failure prediction method based on unbalanced integration two-classification.
[ background of the invention ]
With the continuous development of the information industry, a large amount of paper data has been electronized, and electronic data is continuously generated, so that data storage services are vigorously developed. The size of the disk in the storage system is extremely large, and the stability of the disk is related to the safety and reliability of the whole storage system in the data center. The disk is used as a component with the highest hardware failure rate, and once abnormal operation or data loss occurs, the service cannot be recovered and serious influence is caused. If the disk failure can be predicted in advance, operation and maintenance personnel can be helped to backup data, replace disks and the like in advance, and risks can be greatly avoided or losses can be greatly reduced. At present, disk manufacturers all adopt SMART (Self-Monitoring Analysis and Reporting Technology) to monitor disks, but the fault detection rate of the traditional threshold value judgment method is too low, and the actual early warning effect is not good. The disk failure prediction method based on machine learning enables a satisfactory prediction effect to be obtained through strong learning capability of the model. The method mainly adopts an unbalanced classification method, and needs to collect a large amount of SMART data of healthy and fault disks, and train a classification model after feature extraction is carried out on the data. Many approaches have been proposed to address the problem of unbalanced classification, mainly classified as data-level approaches, algorithm-level approaches, and approaches that combine data processing with algorithms. The method combining data processing and algorithm has better performance in the unbalanced classification problem, but the method does not fully consider the data distribution of a sample space, cannot improve the classification performance by adopting different classifiers in different areas, adopts a simple static strategy to select a model, does not predict a test object separately, and reduces the applicability of the model.
[ summary of the invention ]
In view of this, the embodiment of the present invention provides a disk failure prediction method based on unbalanced two-classification integration, which can effectively solve the problem of high difficulty in predicting disk failures when the number of normal and abnormal samples is unbalanced, and improve the performance of unbalanced classification by adjusting data distribution to generate different classification models, thereby improving the disk failure prediction capability based on machine learning.
The embodiment of the invention provides a disk fault prediction method based on unbalanced integration two-classification, which comprises the following steps:
sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling;
inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary;
and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state.
In the method, SMART data of a disk is sampled, state characteristics related to disk faults are selected as an original data set, and a balanced data set is obtained by data partition mixed sampling: jump analysis and value domain analysis are carried out on the disk data, the characteristics which are related to disk faults and have larger entropy values are selected, the characteristic values are collected to be used as an original data set D, and firstly, the data marked as normal and fault in the original data set D are divided into a plurality of sets DmajAnd minority class set DminDefining a boundary region DborderMinority noise-like region Ddanger-Security of minority classZone Dsafe-A plurality of safety zones Dsafe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set DminMinority class set DminIncluding a minority class of samples xi,i=1,2,...,Nmin,NminFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is countedi-,i=1,2,...,NminWhere k is 13, and storing most of the samples in the neighboring points to the boundary region DborderIn (1), calculating majority class ratio in minority class neighborhoodIf t is 0, adding the minority samples into the minority noise region Ddanger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region Dborder(ii) a If gamma is 1, adding the minority class sample into the minority class safe area Dsafe-(ii) a Adding the samples of the residual training set D into a majority class safety zone Dsafe+Training set D eliminating few noise regions Ddanger-Obtaining a filtered set DfilterCounting the boundary region DborderNumber of samples NborderIncluding the number m of minority samples and the number n of majority samples, in the minority samples xiIn the boundary region DborderA few class samples xborder_i1, 2.. m, for each xborder_iNumber N of majority class samples in neighborhoodi+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples xborder_iCalculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion asAccording to the majority of each minority class neighborhoodThe ratio of the class occupation ratio to the total generates the weightCalculating the resultant number g for each minority sample of the boundary regioni=wiX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region DborderObtaining boundary region oversampling set DBorder crossA plurality of types of security zones Dsafe+Is polymerized intoA cluster of Nsafe+Is a majority class security zone Dsafe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtainedsafe + oweEliminating noise regions D of minority classesdanger-Keeping few class safe zone samples Dsafe-Finally, a balanced data set D is obtainedbalance=Dsafe-+Dsafe + owe+DBorder cross;
In the method, the original data set and the balanced data set of the disk are input into an RF algorithm for machine learning, an original model biased to a plurality of categories and a local region reinforcement and weakening model are respectively trained, and a method for integrating the two models to obtain a mixed model biased to a peripheral boundary comprises the following steps: training a majority of biased original models, a local area strengthening and weakening model and a biased peripheral boundary mixed model by adopting a random forest RF algorithm, wherein the main parameters of the algorithm are the number n _ estimators of a decision tree which is 1000, the criterion of a decision tree splitting mode which is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree which is 2 and the minimum sample number min _ samples _ leaf which is required by leaf nodes which is 1; obtaining original model RF biased to most classes through unbalanced training set D training1Wherein forest size s is 100; by balancing the data set DbalanceTraining to obtain local region reinforcement and attenuation model RF2Wherein forest size s is 100; will be biased towards the original model RF of the majority class1And a local areaReinforcing and attenuating the model RF2The integration of all base classifiers of (a) obtains a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5;
in the method, three models are selected in a self-adaptive manner according to the unbalance degree of the adjacent neighbor of the original disk data set, and the method for predicting the disk fault state by the obtained classification probability comprises the following steps: giving a test set T, searching the neighbors of each test point in the original disk data through a kNN algorithm, and counting the number T of a plurality of classes in the neighborsi+And calculating the unbalance degree of the neighbor of the test pointIf x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected1Carrying out prediction; if it is notDividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is notDividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type2And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:
wherein I () is a function of indication,represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, ht(x) Representing the result of the t decision tree, x representing the test point, y representing two categories including few category 0 and manyThe classes 1 and 1 indicate that the normal probability of disk prediction is high, and 0 indicates that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a flow chart of a disk failure prediction method based on unbalanced integration two-classification according to an embodiment of the present invention;
fig. 2 is a flowchart of a framework of a method for predicting a disk failure based on unbalanced integration two-classification according to an embodiment of the present invention.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a disk failure prediction method based on unbalanced integrated binary classification, please refer to fig. 1, which is a schematic flow chart of the disk failure prediction method based on unbalanced integrated binary classification according to the embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
Specifically, the SMART data set used is from Backblaze company, and SMART is a group of disk self-detection and state monitoring analysis technologyThe method is characterized in that the method is a set of standards established by a disk manufacturer, all monitored and recorded data are called SMART data, jump analysis and value domain analysis are carried out on the SMART data of a disk, characteristics which are related to disk faults and have larger entropy values are selected, and the attributes of the SMART of the disk comprise: reading error rate of bottom layer data, starting and stopping counting, remapping sector counting, seeking error rate, electrifying time accumulation, uncorrectable error, command overtime, magnetic head loading/unloading counting, temperature, current to-be-mapped sector counting, offline uncorrectable sector counting, magnetic head flight time/transmission error rate, LBA writing total number and LBA reading total number, acquiring label data of good and bad disks from a monitoring system after collecting data of the disks, collecting the label data and 14 magnetic disk SMART attribute characteristic values as an original data set D, firstly dividing the data marked as normal and fault in the original data set D into a plurality of sets DmajAnd minority class set DminDefining a boundary region DborderMinority noise-like region Ddanger-Minority class security zone Dsafe-A plurality of safety zones Dsafe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set DminMinority class set DminIncluding a minority class of samples xi,i=1,2,...,Nmin,NminFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is countedi-,i=1,2,...,NminWhere k is 13, and storing most of the samples in the neighboring points to the boundary region DborderIn (1), calculating majority class ratio in minority class neighborhoodIf t is 0, adding the minority samples into the minority noise region Ddanger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region Dborder(ii) a If gamma is 1, adding the minority class sample into the minority class safe area Dsafe-(ii) a Adding the samples of the residual training set D into a majority class safety zone Dsafe+Training set D eliminating few noise regions Ddanger-Obtaining a filtered set DfilterCounting the boundary region DborderNumber of samples NborderIncluding the number m of minority samples and the number n of majority samples, in the minority samples xiIn the boundary region DborderA few class samples xborder_i1, 2.. m, for each xborder_iNumber N of majority class samples in neighborhoodi+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples xborder_iCalculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion as Generating a weight value according to the ratio of the majority class occupation ratio of each minority class neighborhood to the sum of the majority class occupation ratiosCalculating the resultant number g for each minority sample of the boundary regioni=wiX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region DborderObtaining boundary region oversampling set DBorder crossAlgorithm 1 is a pseudo code of the region partition algorithm of step 101:
securing a plurality of classesZone Dsafe+Is polymerized intoA cluster of Nsafe+Is a majority class security zone Dsafe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtainedsafe + oweEliminating noise regions D of minority classesdanger-Keeping few class safe zone samples Dsafe-Finally, a balanced data set D is obtainedbalance=Dsafe-+Dsafe + owe+DBorder cross。
And 102, inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to a plurality of categories and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary.
Specifically, a majority of types of original models, local area reinforcement and weakening models and a periphery boundary biased mixed model are trained by adopting a random forest RF algorithm, the main parameters of the algorithm are that the number n _ estimators of a decision tree is 1000, the criterion of the decision tree splitting mode is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree is 2, and the minimum sample number min _ samples _ leaf required by leaf nodes is 1; obtaining original model RF biased to most classes through unbalanced training set D training1Wherein forest size s is 100; by balancing the data set DbalanceTraining to obtain local region reinforcement and attenuation model RF2Wherein forest size s is 100; will be biased towards the original model RF of the majority class1And local region reinforcement and attenuation model RF2The integration of all base classifiers of (a) yields a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5.
And 103, adaptively selecting three models according to the unbalance degree of the neighbor of the original disk data set, wherein the obtained classification probability is used for predicting the disk fault state.
Specifically, given a disk test data set T, each test point is searched for through a kNN algorithmNeighbor neighbors in the original disk data and counting the number T of most classes in the neighbor neighborsi+And calculating the unbalance degree of the neighbor of the test pointIf x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected1Carrying out prediction; if it is notDividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is notDividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type2And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:
wherein I () is a function of indication,represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, ht(x) And (3) representing the result of the t-th decision tree, x representing a test point, y representing two categories including a minority category 0 and a majority category 1, 1 representing that the normal probability of disk prediction is high, and 0 representing that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.
Algorithm 2 is the pseudo code dynamically selected for the model of step 103:
algorithm 3 is a pseudo code of a disk failure prediction method based on unbalanced integration two classes:
the table includes that the embodiment of the invention provides a public data set applied to a disk failure prediction method based on unbalanced integration two-classification, the detailed information of the data set is described, the detailed information includes feature numbers, data distribution (the number of majority samples and the number of minority samples) and an unbalanced rate (the ratio of the number of majority samples to the number of minority samples), and the table includes a SMART attribute list screened by disk data.
Watch 1
Watch two
The third table is a comparison experiment result of F-measure values (minority class recall ratio and precision ratio harmonic mean value) when the disk failure prediction method based on the unbalanced integrated binary classification is used for solving classification of 10 public data sets and failure prediction of disks, wherein the comparison method in the embodiment of the invention is six methods of RUSboost, SMOTEboost, easylensesemble, BalancedBagging, BRAF and DTE-SBD, which are used for typically solving the unbalanced binary classification problem. From the table III, the method DPHS-MDS provided by the invention has obviously improved F-measure value in the public data set and the disk data set compared with the comparison method. Particularly, the average result of the 10 groups of data sets and the disk data sets is obviously improved by the method, which shows that the disk failure prediction performance is obviously improved. The method provided by the embodiment of the invention makes a certain breakthrough in the aspect of disk failure prediction.
Watch III
In summary, the embodiments of the present invention have the following beneficial effects:
in the technical scheme of the implementation of the invention, SMART data of a disk is sampled, state characteristics related to disk faults are selected as an original data set, and a balanced data set is obtained through data partitioning mixed sampling; inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary; and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state. According to the technical scheme provided by the embodiment of the invention, the problem of high difficulty in predicting the disk fault under the condition of unbalanced number of the normal and abnormal samples can be effectively solved, different classification models are generated by adjusting data distribution so as to improve the performance of unbalanced classification, and the disk fault prediction capability based on machine learning is improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (4)
1. A disk failure prediction method based on unbalanced integrated binary classification is characterized by comprising the following steps:
(1) sampling SMART data of a disk, selecting state characteristics related to disk faults as an original data set, and obtaining a balanced data set through data partition mixed sampling;
(2) inputting the original data set and the balanced data set of the disk into an RF algorithm for machine learning, respectively training an original model biased to most classes and a local area strengthening and weakening model, and integrating the two models to obtain a mixed model biased to a peripheral boundary;
(3) and (4) according to the unbalance degree of the adjacent neighbor of the original disk data set, adaptively selecting three models, and using the obtained classification probability to predict the disk fault state.
2. The method according to claim 1, wherein the SMART data of the disk is sampled, the state features related to the disk failure are selected as an original data set, and a balanced data set is obtained by data partition hybrid sampling, which is specifically described as follows: jump analysis and value domain analysis are carried out on the disk data, the characteristics which are related to disk faults and have larger entropy values are selected, the characteristic values are collected to be used as an original data set D, and firstly, the data marked as normal and fault in the original data set D are divided into a plurality of sets DmajAnd minority class set DminDefining a boundary region DborderMinority noise-like region Ddanger-Minority class security zone Dsafe-A plurality of safety zones Dsafe+And initializing four regions as empty set, + representing majority class samples, -representing minority class samples, and then traversing minority class set DminMinority class set DminIncluding a minority class of samples xi,i=1,2,...,Nmin,NminFor the number of the minority class set samples, k nearest neighbor points of each minority class sample are found through a kNN algorithm, and the number N of the minority class samples in the neighbor points is countedi-,i=1,2,...,NminWhere k is 13, and storing most of the samples in the neighboring points to the boundary region DborderIn (1), calculating majority class ratio in minority class neighborhoodIf t is 0, adding the minority samples into the minority noise region Ddanger-(ii) a If t is equal to (0,1), adding the minority samples and the majority samples in the neighborhood into the boundary region Dborder(ii) a If gamma is 1, adding the minority class sample into the minority class safe area Dsafe-(ii) a Adding the samples of the residual training set D into a majority class safety zone Dsafe+Training set D eliminating few noise regions Ddanger-Obtaining a filtered set DfilterCounting the boundary region DborderNumber of samples NborderIncluding the number m of minority samples and the number n of majority samples, in the minority samples xiIn the boundary region DborderA few class samples xborder_i1, 2.. m, for each xborder_iNumber N of majority class samples in neighborhoodi+I is 1, 2.. times.m, the number of samples that need to be combined for calculating the boundary area G is (m + n) × b-m, b ∈ [0.5,1 ∈]Wherein b is a synthetic scale factor, when b is 1, the number of the synthesized minority samples and the number of the majority samples are kept balanced, the number of the synthesized minority samples is the original total number of the samples, and for each boundary region, the number of the minority samples x is equal to the number of the majority samples xborder_iCalculating the proportion of samples belonging to most classes in the k adjacent sample points, and recording the proportion asGenerating a weight value according to the ratio of the majority class occupation ratio of each minority class neighborhood to the sum of the majority class occupation ratiosCalculating the resultant number g for each minority sample of the boundary regioni=wiX G, i ═ 1, 2.. times, m, gi minority sample were synthesized around each minority sample in the border region by the SMOTE method, and the synthesized minority sample was added to the border region DborderObtaining boundary region oversampling set DBorder crossA plurality of types of security zones Dsafe+Is polymerized intoA cluster of Nsafe+Is a majority class security zone Dsafe+The number of samples in each cluster is randomly undersampled, and the number of samples is half of the number of samples in the cluster, so that an undersampled set D of a plurality of types of safety zone is obtainedsafe + oweEliminating noise regions D of minority classesdanger-Keeping few class safe zone samples Dsafe-Finally, a balanced data set D is obtainedbalance=Dsafe-+Dsafe + owe+DBorder cross。
3. The method of claim 1, wherein the original data set and the balanced data set of the disk are inputted into an RF algorithm for machine learning, and the original model biased to a plurality of categories and the local region reinforcement and weakening model are trained respectively, and the two models are integrated to obtain a mixed model biased to the peripheral boundary, specifically: training a majority of biased original models, a local area strengthening and weakening model and a biased peripheral boundary mixed model by adopting a random forest RF algorithm, wherein the main parameters of the algorithm are the number n _ estimators of a decision tree which is 1000, the criterion of a decision tree splitting mode which is gini', the minimum sample number min _ samples _ split of internal nodes of the split tree which is 2 and the minimum sample number min _ samples _ leaf which is required by leaf nodes which is 1; obtaining original model RF biased to most classes through unbalanced training set D training1Wherein forest size s is 100; by balancing the data set DbalanceTraining to obtain local region reinforcement and attenuation model RF2Wherein forest size s is 100; will be biased towards the original model RF of the majority class1And local region reinforcement and attenuation model RF2The integration of all base classifiers of (a) yields a hybrid model RF biased towards the peripheral boundary, where forest size s is 200 and model ratio q is 0.5.
4. The method of claim 1, wherein the three models are adaptively selected based on a degree of imbalance placed in close proximity to the raw disk data set, and the classification probabilities obtained are used to predictThe disk failure state is specifically described as follows: giving a test set T, searching the neighbors of each test point in the original disk data through a kNN algorithm, and counting the number T of a plurality of classes in the neighborsi+And calculating the unbalance degree of the neighbor of the test pointIf x is 1, the test point sample is divided into types of all the types around, and the original model RF biased to the types of the test points is selected1Carrying out prediction; if it is notDividing the test point samples into a plurality of types around, and selecting a mixed model RF deviating from the peripheral boundary for the test point of the type to predict; if it is notDividing the test point sample into a small number of types of most types around, and selecting local reinforcing and weakening model RF for the test point of the type2And (3) forecasting, finally, integrating decision tree results of all models, and obtaining a final classification result Lable by hard voting:
wherein I () is a function of indication,represents the prediction category corresponding to the test sample when Lable (x) takes the maximum value, ht(x) And (3) representing the result of the t-th decision tree, x representing a test point, y representing two categories including a minority category 0 and a majority category 1, 1 representing that the normal probability of disk prediction is high, and 0 representing that the failure probability of disk prediction is high, so that the actual disk prediction state of the test sample is obtained.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911339988.3A CN111091201A (en) | 2019-12-23 | 2019-12-23 | Data partition mixed sampling-based unbalanced integrated classification method |
CN2019113399883 | 2019-12-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112465153A true CN112465153A (en) | 2021-03-09 |
Family
ID=70395790
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911339988.3A Pending CN111091201A (en) | 2019-12-23 | 2019-12-23 | Data partition mixed sampling-based unbalanced integrated classification method |
CN202011510541.0A Pending CN112465153A (en) | 2019-12-23 | 2020-12-18 | Disk fault prediction method based on unbalanced integrated binary classification |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911339988.3A Pending CN111091201A (en) | 2019-12-23 | 2019-12-23 | Data partition mixed sampling-based unbalanced integrated classification method |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN111091201A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113434401A (en) * | 2021-06-24 | 2021-09-24 | 杭州电子科技大学 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
CN113591896A (en) * | 2021-05-18 | 2021-11-02 | 广西电网有限责任公司电力科学研究院 | Power grid attack event classification detection method |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091201A (en) * | 2019-12-23 | 2020-05-01 | 北京邮电大学 | Data partition mixed sampling-based unbalanced integrated classification method |
CN112364706A (en) * | 2020-10-19 | 2021-02-12 | 燕山大学 | Small sample bearing fault diagnosis method based on class imbalance |
CN112365060B (en) * | 2020-11-13 | 2024-01-26 | 广东电力信息科技有限公司 | Preprocessing method for network Internet of things sensing data |
CN112508243B (en) * | 2020-11-25 | 2022-09-09 | 国网浙江省电力有限公司信息通信分公司 | Training method and device for multi-fault prediction network model of power information system |
CN112800917B (en) * | 2021-01-21 | 2022-07-19 | 华北电力大学(保定) | Circuit breaker unbalance monitoring data set oversampling method |
CN112836735B (en) * | 2021-01-27 | 2023-09-01 | 中山大学 | Method for processing unbalanced data set by optimized random forest |
CN112633426B (en) * | 2021-03-11 | 2021-06-15 | 腾讯科技(深圳)有限公司 | Method and device for processing data class imbalance, electronic equipment and storage medium |
CN114612255B (en) * | 2022-04-08 | 2023-11-07 | 湖南提奥医疗科技有限公司 | Insurance pricing method based on electronic medical record data feature selection |
CN114969669B (en) * | 2022-07-27 | 2022-11-15 | 深圳前海环融联易信息科技服务有限公司 | Data balance degree processing method, joint modeling system, device and medium |
CN115374858B (en) * | 2022-08-24 | 2024-05-14 | 东北大学 | Intelligent diagnosis method for flow industrial production quality based on hybrid integrated model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359704A (en) * | 2018-12-26 | 2019-02-19 | 北京邮电大学 | A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision |
CN111091201A (en) * | 2019-12-23 | 2020-05-01 | 北京邮电大学 | Data partition mixed sampling-based unbalanced integrated classification method |
-
2019
- 2019-12-23 CN CN201911339988.3A patent/CN111091201A/en active Pending
-
2020
- 2020-12-18 CN CN202011510541.0A patent/CN112465153A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359704A (en) * | 2018-12-26 | 2019-02-19 | 北京邮电大学 | A kind of more classification methods integrated based on adaptive equalization with dynamic layered decision |
CN111091201A (en) * | 2019-12-23 | 2020-05-01 | 北京邮电大学 | Data partition mixed sampling-based unbalanced integrated classification method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591896A (en) * | 2021-05-18 | 2021-11-02 | 广西电网有限责任公司电力科学研究院 | Power grid attack event classification detection method |
CN113434401A (en) * | 2021-06-24 | 2021-09-24 | 杭州电子科技大学 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
CN113434401B (en) * | 2021-06-24 | 2022-10-28 | 杭州电子科技大学 | Software defect prediction method based on sample distribution characteristics and SPY algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN111091201A (en) | 2020-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112465153A (en) | Disk fault prediction method based on unbalanced integrated binary classification | |
CN108986869B (en) | Disk fault detection method using multi-model prediction | |
CN110703057B (en) | Power equipment partial discharge diagnosis method based on data enhancement and neural network | |
CN104503874A (en) | Hard disk failure prediction method for cloud computing platform | |
Chien et al. | A system for online detection and classification of wafer bin map defect patterns for manufacturing intelligence | |
Yu et al. | Pareto-optimal adaptive loss residual shrinkage network for imbalanced fault diagnostics of machines | |
CN106682688A (en) | Pile-up noise reduction own coding network bearing fault diagnosis method based on particle swarm optimization | |
CN107168995B (en) | Data processing method and server | |
CN105760889A (en) | Efficient imbalanced data set classification method | |
CN112951311B (en) | Hard disk fault prediction method and system based on variable weight random forest | |
CN112214369A (en) | Hard disk fault prediction model establishing method based on model fusion and application thereof | |
CN103941131A (en) | Transformer fault detecting method based on simplified set unbalanced SVM (support vector machine) | |
CN111881289B (en) | Training method of classification model, and detection method and device of data risk class | |
CN112365060B (en) | Preprocessing method for network Internet of things sensing data | |
KR102144010B1 (en) | Methods and apparatuses for processing data based on representation model for unbalanced data | |
CN116582300A (en) | Network traffic classification method and device based on machine learning | |
Bhat et al. | An empirical evaluation of defect prediction approaches in within-project and cross-project context | |
CN112699936B (en) | Electric power CPS generalized false data injection attack identification method | |
CN110673997B (en) | Disk failure prediction method and device | |
CN110991241B (en) | Abnormality recognition method, apparatus, and computer-readable medium | |
CN115438239A (en) | Abnormity detection method and device for automatic abnormal sample screening | |
Becker et al. | Rough set theory in the classification of loan applications | |
CN115545111B (en) | Network intrusion detection method and system based on clustering self-adaptive mixed sampling | |
Zhihao et al. | Comparison of the different sampling techniques for imbalanced classification problems in machine learning | |
CN114756420A (en) | Fault prediction method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210309 |
|
WD01 | Invention patent application deemed withdrawn after publication |